CN110047468A

CN110047468A - Audio recognition method, device and storage medium

Info

Publication number: CN110047468A
Application number: CN201910418620.XA
Authority: CN
Inventors: 曲贺; 王晓瑞; 李岩
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2019-07-23
Anticipated expiration: 2039-05-20
Also published as: CN110047468B

Abstract

The disclosure is directed to a kind of audio recognition method, device and storage mediums, belong to machine learning techniques field.Method includes: to obtain audio frame to be identified；The Meier scale filter group feature and sounding user information vector of audio frame are extracted respectively；Fusion treatment is carried out to Meier scale filter group feature and sounding user information vector, obtains fusion feature；Fusion feature is handled based on target acoustical model, obtains the speech recognition result of audio frame, target acoustical model includes multiple empty convolutional layers.The disclosure can extract the Meier scale filter group feature and sounding user information vector of audio frame simultaneously, later, the two is subjected to Fusion Features and fused feature is inputted into acoustic model, since fused feature can carry out effective expression to speaker characteristic and channel characteristics, the accuracy rate of speech recognition is improved；In addition, including multiple empty convolutional layers in acoustic model, calculation amount can be reduced under identical receptive field, accelerate speech recognition speed.

Description

Audio recognition method, device and storage medium

Technical field

This disclosure relates to machine learning techniques field more particularly to a kind of audio recognition method, device and storage medium.

Background technique

Speech recognition is also referred to as automatic speech recognition (Automatic Speech Recognition, ASR), is one and allows Voice signal is changed into the technology of corresponding text or order by identifying by machine with understanding process.Speech recognition technology at present It is multiple industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product etc. have been widely used in Field.

Wherein, in speech recognition process, the accuracy of speech recognition and speed are most important.It is well known that voice is known Other accuracy rate is higher and speed is faster, and the satisfaction of user is just higher.For this purpose, how accurately and quickly to carry out voice knowledge Not, to improve speech recognition effect, become those skilled in the art's problem urgently to be resolved.

Summary of the invention

The disclosure provides a kind of audio recognition method, device and storage medium, can be effectively improved speech recognition effect.

According to the first aspect of the embodiments of the present disclosure, a kind of audio recognition method is provided, comprising:

Obtain audio frame to be identified；

The Meier scale filter group feature and sounding user information vector of the audio frame are extracted respectively；

Fusion treatment is carried out to the Meier scale filter group feature and the sounding user information vector, is merged Feature；

The fusion feature is handled based on target acoustical model, obtains the speech recognition result of the audio frame, The target acoustical model includes multiple empty convolutional layers.

In one possible implementation, described that the Meier scale filter group feature and the sounding user are believed It ceases vector and carries out fusion treatment, comprising:

Standardization processing is carried out to the Meier scale filter group feature, obtains the first intermediate features；

Dimension conversion process is carried out to the sounding user information vector, obtains the second intermediate features, among described second The dimension of feature is greater than the dimension of the sounding user information vector；

Standardization processing is carried out to second intermediate features, obtains third intermediate features；

Fusion treatment is carried out to first intermediate features and the third intermediate features, obtains the fusion feature.

It is in one possible implementation, described that standardization processing is carried out to the Meier scale filter group feature, Obtain the first intermediate features, comprising:

Based on the first BatchNorm (batch standardization) layer, by the Meier scale filter group characteristic criterion to mean value For 0 and variance is 1, obtains first intermediate features；

It is described that standardization processing is carried out to second intermediate features, obtain third intermediate features, comprising:

Based on the 2nd BatchNorm layers, it is 0 and variance is 1 that second intermediate features, which are standardized to mean value, obtains institute State third intermediate features.

In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM (Long Short-Term Memory, shot and long term memory) network, the cavity convolutional neural networks include the multiple cavity Convolutional layer, the LSTM network include LSTM layers multiple；

It is described that the fusion feature is handled based on target acoustical model, obtain the speech recognition knot of the audio frame Fruit, comprising:

By the fusion feature input empty convolutional neural networks, successively by the multiple empty convolutional layer to institute It states fusion feature to be handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer；

Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively by described Multiple LSTM layers handle the first output result, wherein upper one LSTM layers of output is next LSTM layers Input；

Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.

In one possible implementation, described that first intermediate features and the third intermediate features are melted Conjunction processing, obtains the fusion feature, comprising:

Column exchange processing is carried out to first intermediate features and the third intermediate features, obtains the fusion feature； Or,

Conversion process is weighted to first intermediate features and the third intermediate features based on weight matrix, is obtained The fusion feature.

According to the second aspect of an embodiment of the present disclosure, a kind of speech recognition equipment is provided, comprising:

Acquiring unit is configured as obtaining audio frame to be identified；

Extraction unit is configured to extract the Meier scale filter group feature of the audio frame and sounding user letter Cease vector；

Integrated unit is configured as carrying out the Meier scale filter group feature and the sounding user information vector Fusion treatment obtains fusion feature；

Processing unit is configured as handling the fusion feature based on target acoustical model, obtains the audio The speech recognition result of frame, the target acoustical model include multiple empty convolutional layers.

In one possible implementation, the integrated unit, comprising:

First processing subelement, is configured as carrying out standardization processing to the Meier scale filter group feature, obtain First intermediate features；

Second processing subelement is configured as carrying out dimension conversion process to the sounding user information vector, obtains the Two intermediate features, the dimension of second intermediate features are greater than the dimension of the sounding user information vector；

Third handles subelement, is configured as carrying out standardization processing to second intermediate features, obtain among third Feature；

Subelement is merged, is configured as carrying out fusion treatment to first intermediate features and the third intermediate features, Obtain the fusion feature.

In one possible implementation, the first processing subelement, is additionally configured to based on the first BatchNorm Layer, by the Meier scale filter group characteristic criterion be 0 to mean value and variance is 1, obtains first intermediate features；

The third handles subelement, is configured as based on the 2nd BatchNorm layers, by the second intermediate features specification Change is 0 to mean value and variance is 1, obtains the third intermediate features.

In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM net Network, the cavity convolutional neural networks include the multiple empty convolutional layer, and the LSTM network includes LSTM layers multiple；

The processing unit is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively pass through It crosses the multiple empty convolutional layer to handle the fusion feature, wherein the output of upper one empty convolutional layer is next The input of a cavity convolutional layer；Using the first output result of the last one empty convolutional layer as the input of the LSTM network, Successively the first output result is handled by the multiple LSTM layers, wherein under upper one LSTM layers of output is One LSTM layers of input；Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.

In one possible implementation, the fusion subelement, be additionally configured to first intermediate features and The third intermediate features carry out column exchange processing, obtain the fusion feature；Or, based on weight matrix among described first Feature and the third intermediate features are weighted conversion process, obtain the fusion feature.

According to the third aspect of an embodiment of the present disclosure, a kind of speech recognition equipment is provided, comprising:

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to: execute audio recognition method described in above-mentioned first aspect.

According to a fourth aspect of embodiments of the present disclosure, a kind of non-transitorycomputer readable storage medium is provided, when described When instruction in storage medium is executed by the processor of speech recognition equipment, so that speech recognition equipment is able to carry out above-mentioned first Audio recognition method described in aspect.

According to a fifth aspect of the embodiments of the present disclosure, a kind of application program is provided, when the instruction in the application program by When the processor of speech recognition equipment executes, so that speech recognition equipment is able to carry out speech recognition described in above-mentioned first aspect Method.

The technical solution that the embodiment of the present disclosure provides can include the following benefits:

In speech recognition process, the embodiment of the present disclosure can extract simultaneously audio frame Meier scale filter group feature and Sounding user information vector, later, by the carry out Fusion Features of the two types, and using fused feature as acoustic feature It inputs acoustic model and carries out speech recognition, since fused feature can carry out effective table to speaker characteristic and channel characteristics It reaches, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition；In addition, including multiple empty convolution in acoustic model Layer, can efficiently reduce calculation amount using the empty characteristic of empty convolutional layer, therefore accelerate voice under identical receptive field The audio recognition method that the speed of identification, the i.e. embodiment of the present disclosure provide can be effectively improved speech recognition effect.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.

Fig. 1 is a kind of schematic diagram of empty convolution shown according to an exemplary embodiment.

Fig. 2 is the structural representation for the implementation environment that a kind of audio recognition method shown according to an exemplary embodiment is related to Figure.

Fig. 3 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment.

Fig. 4 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment.

Fig. 5 is a kind of schematic diagram of multiple features fusion process shown according to an exemplary embodiment.

Fig. 6 is a kind of structural schematic diagram of acoustic model shown according to an exemplary embodiment.

Fig. 7 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.

Fig. 8 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.

Fig. 9 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.

Figure 10 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.

Before to the embodiment of the present disclosure carrying out that explanation is explained in detail, some names that first embodiment of the present disclosure is related to Word is explained.

Meier scale filter group feature: in the embodiments of the present disclosure, Meier scale filter group feature refers to FilterBank feature, also referred to as FBank feature.Wherein, FilterBank algorithm is a kind of front-end processing algorithm, to be similar to people The mode of ear handles audio, can be improved speech recognition performance.

Sounding user information vector: in the embodiments of the present disclosure, sounding user information vector refers to i-vector feature.Its In, i-vector feature not only includes speaker's different information, but also including channel different information.A kind of expression way is changed, I-vector feature can effectively be indicated speaker characteristic and channel characteristics that is, i-vector feature is used to speaker Character representation is carried out with channel.

Empty convolution (Dilated Convolution): also known as expansion convolution or expansion convolution are capable of increasing receptive field (reception field).Wherein, it in convolutional neural networks, determines in a certain layer output result corresponding to an element The area size of input layer, is called receptive field.It is that receptive field is in convolutional neural networks with the language expression of mathematics One element of a certain layer output result corresponds to a mapping of input layer.

Referring to Fig. 1, the empty convolution operation of empty multiplying power (dilated rate) equal to 1,2 and 3 when is respectively illustrated.Its In, the left figure of Fig. 1 corresponds to the cavity the 1-dilated convolution operation of 3x3 convolution kernel size, which grasps with common convolution Make the same.The middle figure of Fig. 1 corresponds to the cavity the 2-dilated convolution operation of 3x3 convolution kernel size, and actual convolution kernel size is still 3x3, but cavity is 1, that is to say, that the feature for the characteristic area of a 7x7 size, at only 9 black square blocks Convolution operation occurs with the convolution kernel of 3x3 size, remaining is skipped over.It can be appreciated that the size of convolution kernel is 7x7, but only Having the weight at 9 in figure black square blocks is not 0, remaining is 0.It can see by middle figure, although the size of convolution kernel Only 3x3, but the receptive field size of this convolution has had increased to 7x7.Wherein, it is big to correspond to 3x3 convolution kernel for the right figure of Fig. 1 The small cavity 3-dilated convolution operation.

Criticize normalization layer: in the embodiments of the present disclosure, batch normalization layer refers to BatchNorm layers.Wherein, BatchNorm The effect of layer is by certain standardization means, just by the distribution transformation of input data to mean value is 0, variance is 1 standard State distribution.

Explanation is introduced in the implementation environment that a kind of audio recognition method provided below the embodiment of the present disclosure is related to.

The audio recognition method that the embodiment of the present disclosure provides is applied to speech recognition apparatus.Referring to fig. 2, speech recognition apparatus 201 be the computer equipment with machine learning ability, for example, the computer equipment can be PC, server etc. admittedly Fixed pattern computer equipment, can also be the Mobile Computings machine equipment such as tablet computer, smart phone, the embodiment of the present disclosure to this not Specifically limited.Wherein, speech recognition apparatus 201 includes the characteristic extracting module for carrying out front-end processing, and carries out rear end The acoustic model of processing.

It is well known that the accuracy rate and speed of speech recognition are most important in speech recognition process.The relevant technologies into When row speech recognition, for front-end processing, usually extraction MFCC (Mel Frequency Cepstrum Coefficient, plum That frequency cepstral coefficient) feature, and MFCC feature can not carry out effectively character representation to speaker's difference and channel difference.Needle To back-end processing, the relevant technologies are made by using CNN (Convolutional Neural Networks, convolutional neural networks) It is identified for acoustic model, and CNN belongs to feedforward neural network, the convolution kernel of this time dimension convolution is in time dimension Left and right unfolding calculation, that is, rely on history audio frame and future audio frame；And more left and right audio frames is relied on, convolutional Neural The receptive field of network is just bigger, and when multilayer convolutional neural networks stack, the left and right audio frame for needing to rely on can be more, calculation amount It is very huge, and then lead to the reduction of speech recognition speed.

For this purpose, the embodiment of the present disclosure provides a kind of audio recognition method, on the one hand, this kind of audio recognition method will do it Multiple features fusion, i.e., can extract FBank feature and i-vector feature simultaneously when carrying out feature extraction, and by FBank feature Input with i-vector Fusion Features as speech recognition, to improve the accuracy rate of speech recognition.

On the other hand, this kind of audio recognition method is using a kind of new deep neural network as acoustic model, the acoustic mode Type includes empty convolutional neural networks and LSTM network, i.e. acoustic model includes Dilated-CNN+LSTM, compared to CNN, In the case where identical receptive field, it is able to maintain less convolution kernel, to efficiently reduce calculation amount, improves speech recognition Speed.

In addition, empty convolution is bigger compared to the receptive field of CNN under identical calculation amount, and then can capture more Information do not consider recognition speed for this angle, the accuracy rate of speech recognition also can be improved using empty convolution.

In addition, bigger empty multiplying power receptive field is bigger, and then more information can be captured, so using empty convolution The accuracy rate of speech recognition can be improved.

Fig. 3 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment, as shown in figure 3, the party Method is for including the following steps in speech recognition apparatus shown in Fig. 2.

In step 301, audio frame to be identified is obtained.

In step 302, the Meier scale filter group feature and sounding user information vector of audio frame are extracted respectively.

In step 303, fusion treatment is carried out to Meier scale filter group feature and sounding user information vector, obtained Fusion feature.

In step 304, fusion feature is handled based on target acoustical model, obtains the speech recognition knot of audio frame Fruit, target acoustical model include multiple empty convolutional layers.

The method that the embodiment of the present disclosure provides, in speech recognition process, the embodiment of the present disclosure can extract audio frame simultaneously Meier scale filter group feature and sounding user information vector, later, by the carry out Fusion Features of the two types, and will Fused feature carries out speech recognition as acoustic feature input acoustic model, since fused feature can be to speaker Feature and channel characteristics carry out effective expression, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition；In addition, sound Learning in model includes multiple empty convolutional layers, can effectively be subtracted under identical receptive field using the empty characteristic of empty convolutional layer Few calculation amount, therefore accelerate the speed of speech recognition, i.e., the audio recognition method that the embodiment of the present disclosure provides can effectively change Kind speech recognition effect.

It is 0 by the Meier scale filter group characteristic criterion to mean value based on first BatchNorm layers of standardization And variance is 1, obtains first intermediate features；

All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.

It should be noted that the similar this description such as first, second, third, fourth occurred in following embodiments, is only For distinguishing different objects, without constituting any other particular determination to each object.

Fig. 4 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment, as shown in figure 4, the party Method is for including the following steps in speech recognition apparatus shown in Fig. 2.

In step 401, audio frame to be identified is obtained.

Wherein, audio frame generally refers to a bit of audio of regular length.As an example, in speech recognition, lead to 10 to 30ms (milliseconds) often are set by frame length, i.e., the playing duration of one audio frame is 10 to 30ms, existing in such frame Enough periods, and will not change too acutely.In the embodiments of the present disclosure, the playing duration of an audio frame is 25ms, i.e., Frame length is 25ms, and it is 10ms that frame, which moves,.

In one possible implementation, speech recognition apparatus is before carrying out feature extraction, it will usually to speaker Voice pre-processed, wherein pretreatment includes but is not limited to framing, pre- enhancing, adding window, noise reduction etc..

In addition, the voice of speaker either speech recognition apparatus configuration the collected voice of voice acquisition device, It is also possible to the voice that other equipment are sent to speech recognition apparatus, the embodiment of the present disclosure is to this without specifically limiting.

As an example, the embodiment of the present disclosure can carry out speech recognition to one audio of an audio frame frame by frame, alternatively, Speech recognition is carried out to multiple audios, the embodiment of the present disclosure is to this equally without specifically limiting.

In step 402, the FBank feature and i-vector feature of audio frame to be identified are extracted respectively.

Extract FBank feature

For FBank feature, FBank feature extraction needs to carry out after pre-processing, and at this moment the voice of speaker has been Framing is completed, audio frame one by one has been obtained, that is, has needed to extract FBank feature frame by frame.Due to carrying out obtaining after sub-frame processing To be still time-domain signal, and in order to extract FBank feature, just need time-domain signal being converted to frequency-region signal first.

In one possible implementation, signal can be gone to frequency domain from time domain by Fourier transformation.Further, Fu In leaf transformation can be divided into continuous fourier transform and discrete Fourier transform again, due to audio frame be digital audio and it is non-analog Audio, therefore the embodiment of the present disclosure extracts FBank feature using discrete Fourier transform.As an example, usually using FFT (Fast Fourier Transformation, Fast Fourier Transform (FFT)) carries out FBank feature extraction frame by frame.

In one possible implementation, for carrying out speech recognition frame by frame, the dimension of FBank feature can be 40 Dimension, the embodiment of the present disclosure is to this without specifically limiting.

Extract i-vector feature

JFA (Joint Factor Analysis, simultaneous factor analysis) method is using GMM (Gaussian Mixture Model, gauss hybrid models) super vector space subspace, speaker's difference and channel difference are modeled respectively, to classify Channel disturbance out.However, channel factors can also carry the information of part speaker in JFA model, when compensating, can damage Lose a part of speaker information.Based on this, global disparity spatial model is proposed, using speaker's difference and channel difference as one A entirety is modeled, this kind of method improves the high problem of requirement and JFA computation complexity of the JFA to training corpus, simultaneously Performance is also suitable with JFA.

The Duan Yuyin of given speaker, corresponding Gaussian mean super vector can be defined as:

M=m+Tw

Wherein, M is the Gaussian mean super vector of given voice；M be UBM (Uniform Background Model, it is general Background model) Gaussian mean super vector, the super vector is unrelated with specific speaker and channel；T is global disparity spatial moment Battle array, low-rank；W is global disparity steric factor, its Posterior Mean is i-vector feature, it a priori obeys standard normal Distribution.

In above-mentioned formula, M and m can be calculated, and global disparity space matrix T and global disparity steric factor w Estimated.Wherein, global disparity space matrix T thinks that all given voices are all from different speakers, even if It is that the multistage voice of the same speaker is similarly considered from different people.I-vector characterizing definition is global disparity space The maximum posteriori point estimation of factor w, that is to say the Posterior Mean of w.In one possible implementation, the overall situation is obtained in estimation After difference space matrix T, to the voice of given speaker, zero and first order Baum-Welch statistic is extracted, can be calculated The estimated value of i-vector feature.

In one possible implementation, for carrying out speech recognition frame by frame, the dimension of i-vector feature can be 100 dimensions, the embodiment of the present disclosure is to this without specifically limiting.

In step 403, fusion treatment is carried out to the FBank feature and i-vector feature extracted, it is special obtains fusion Sign.

In one possible implementation, fusion treatment is carried out to the FBank feature and i-vector feature extracted, The following steps are included:

4031, standardization processing is carried out to FBank feature, obtains the first intermediate features.

Referring to Fig. 5, it illustrates multiple features fusion processes.For FBank feature, a BatchNorm can be passed through first Layer carries out standardization processing.It should be noted that for the ease of distinguishing, this BatchNorm layers is referred to as by the embodiment of the present disclosure It is the first BatchNorm layers.

In one possible implementation, standardization processing is carried out to FBank feature, obtains the first intermediate features, wrapped Include but be not limited to: based on the first BatchNorm layers, by FBank characteristic criterion to mean value be 0 and variance is 1, is obtained in first Between feature.

Wherein, as shown in figure 5, FBank feature after BatchNorm layers dimension do not change, still be 40 dimension. In addition, being referred to herein as the first intermediate features by BatchNorm layers of FBank feature for the ease of appellation.

4032, dimension conversion process is carried out to i-vector feature, obtains the second intermediate features.

As shown in figure 5, to the i-vector feature for carrying speaker characteristic and channel characteristics carry out standardization processing it Before, Linear Mapping (linear) layer can be first passed through, dimension conversion process is carried out to i-vector feature.A kind of possible In implementation, which is to rise dimension processing, i.e. it is special to be greater than initial i-vector for the dimension of the second intermediate features The dimension of sign.For example, i-vector feature is 100 dimensions before by linear layers, by the linear layers of i- by 100 dimensions Vector Feature Mapping is known as 200 dimensions.

Wherein, it is also referred to as the second intermediate features herein by linear layers of i-vector feature.

4033, standardization processing is carried out to the second intermediate features, obtains third intermediate features.

Referring to Fig. 5, i-vector feature after by linear layers, can also using one BatchNorm layers, in order to Convenient for distinguishing, the embodiment of the present disclosure is by this BatchNorm layer referred to as the 2nd BatchNorm layers.

Similarly, standardization processing is carried out to the second intermediate features, obtains third intermediate features, including but not limited to: base In the 2nd BatchNorm layers, it is 0 and variance is 1 that the second intermediate features, which are standardized to mean value, obtains third intermediate features.

As shown in figure 5, dimension does not become after linear layers of i-vector feature is using BatchNorm layers Change, is still 200 dimensions.In addition, being also referred to as herein for the ease of appellation by BatchNorm layers of the second intermediate features For third intermediate features.

4034, fusion treatment is carried out to the first intermediate features and third intermediate features, obtains fusion feature.

Next, referring to Fig. 5, FBank feature and i-vector feature after two BatchNorm layers together as Input is input to fusion (combine) layer and carries out fusion treatment.

Wherein, combine layers either a Linear Mapping layer is also possible to a column switching layer, the embodiment of the present disclosure To this without specifically limiting.In one possible implementation, combine layers of parameter can random initializtion and based on anti- It optimizes to propagation algorithm, for example is optimized using stochastic gradient descent algorithm, the embodiment of the present disclosure is to this without tool Body limits.

As an example, fusion treatment is carried out to the first intermediate features and third intermediate features, obtains fusion feature, wrapped It includes but is not limited to take following two mode:

The first, when combine layers be column switching layer when, column exchange is carried out to the first intermediate features and third intermediate features Processing, obtains fusion feature.Wherein, the dimension of fusion feature is consistent with the dimension of the first intermediate features and third intermediate features. As shown in figure 5, when third intermediate features are 200 dimension, the dimension of fusion feature is 40* when the dimension of the first intermediate features is 40 6.As an example, column exchange processing is used for the column and column of exchange features, for example first row feature is exchanged with secondary series feature, Or first row feature is exchanged with last column feature, the embodiment of the present disclosure is to this without specifically limiting.

Second, when being linear layers for combine layers, based on weight matrix to special among the first intermediate features and third Sign is weighted conversion process, obtains fusion feature.Merged by linear layer, be equivalent to the first intermediate features with Third intermediate features are multiplied by a weight.Wherein, the weight matrix can random initializtion and and acoustic model combine instruction together It gets, the embodiment of the present disclosure is to this without specifically limiting.

After the FBank feature by after two BatchNorm layers and i-vector feature are carried out above-mentioned fusion treatment, It is capable of the robustness of Enhanced feature.

In step 404, fusion feature is handled based on target acoustical model, obtains the language of audio frame to be identified Sound recognition result, wherein target acoustical model includes empty convolutional neural networks and LSTM network, the cavity convolutional neural networks Including multiple empty convolutional layers, which includes LSTM layers multiple.

In the embodiments of the present disclosure, in above-mentioned steps 403 combine layers output be target acoustical model input.

In one possible implementation, referring to Fig. 6, empty convolutional neural networks include 6 empty convolutional layers altogether, It is referred to respectively with empty convolutional layer 0 to empty convolutional layer 5.Wherein, empty convolution is 2 dimension convolution, and convolution kernel size is M*N, In, M represents the size of convolution, and N represents the size of frequency domain convolution.As an example, empty convolutional layer 0 is to empty convolution Every layer of convolution kernel is as described below in totally 6 empty convolutional layers for layer 5:

For the 0th layer of empty convolution, convolution kernel size 7*3, convolution kernel number is 64；For the 1st layer of empty convolution, volume Product core size is 5*3, and convolution kernel number is 64；For the 2nd layer of empty convolution, convolution kernel size is 3*3, and convolution kernel number is 128；For the 3rd layer of empty convolution, convolution kernel size is 3*3, and convolution kernel number is 128；For the 4th layer of empty convolution, Convolution kernel size is 3*3, and convolution kernel number is 256；For the 5th layer of empty convolution, convolution kernel size is 3*3, convolution kernel Number is 256.

Wherein, LSTM network is widely applied a kind of RNN (Recurrent Neural in acoustic model at present Networks, Recognition with Recurrent Neural Network) structure.Compared to common RNN, LSTM controls information by well-designed door Storage, output and input, while the gradient disappearance problem of common RNN can be avoided to a certain extent, so that LSTM network can With effectively to clock signal it is long when correlation model.

In one possible implementation, the LSTM network in acoustic model generally comprises 3-5 LSTM layers, the reason is that The promotion that more deeper networks of LSTM layer building tend not to bring performance is directly accumulated, the performance of model can be made instead It is worse.It as an example, include 3 LSTM layers in LSTM network referring to Fig. 6.

Wherein, the place that LSTM is different from RNN essentially consists in: LSTM joined in the algorithm one judge information it is useful with The structure of no " processor ", this " processor " effect is referred to as cell, i.e. LSTM cell.Wherein, a LSTM cell In be placed three doors, respectively input gate, forget door and out gate.

In one possible implementation, in training objective acoustic model, can using dictionary as training corpus, with Acoustic model with framework shown in Fig. 6 is initial model, with single-tone element, polynary phoneme, letter, word or Chinese character etc. for training Target carrys out Optimized model with stochastic gradient descent algorithm, and the embodiment of the present disclosure is to this without specifically limiting.

In the embodiments of the present disclosure, fusion feature is handled based on target acoustical model, obtains the voice of audio frame Recognition result includes the following steps:

4041, fusion feature is inputted into empty convolutional neural networks, successively by multiple empty convolutional layers to fusion spy Sign is handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer.As shown in Figure 6, that is, Empty convolution operation successively is carried out to the fusion feature by empty convolutional layer 0 to empty convolutional layer 5.

4042, using the first output result of the last one empty convolutional layer as the input of LSTM network, successively through excessive A LSTM layers handles the first output result, wherein the input that upper one LSTM layers of output is next LSTM layers

Herein, in order to distinguish the output of empty convolutional neural networks and LSTM network as a result, by empty convolutional Neural The output result of network is referred to as the first output as a result, the output result of LSTM network is referred to as the second output result.Such as Fig. 6 It is shown, i.e., once pass through the 0th layer of LSTM, the 1st layer of LSTM and the 2nd layer of LSTM, to the output results of empty convolutional neural networks into Row processing.

4043, the speech recognition result of audio to be identified is determined based on the last one LSTM layers of the second output result.

Wherein, the empty convolutional neural networks of fusion feature input multiple-level stack can effectively be learnt to more abstract First output result is output in the LSTM network of multiple-level stack, finally exports by output layer currently wait know later by feature Corresponding phonetic (acoustics) classification of other audio frame.

Wherein, phonetic classification can be a polynary phoneme (senone), or can be a phoneme (phone), letter, Chinese character or word either be can also be.

Implementation above determines the phonetic classification of the acoustic feature extracted from audio frame based on acoustic model, In a kind of possible implementation, also language model and decoding technique can be used further to be converted into what user can understand Text, the embodiment of the present disclosure is to this without specifically limiting.

The embodiment of the present disclosure provide method, in speech recognition process, can extract simultaneously audio frame FBank feature and FBank feature and i-vector feature are carried out Fusion Features later by i-vector feature, and using fused feature as sound Feature, which is input in acoustic model, carries out speech recognition, since fused feature being capable of effective expression speaker characteristic and letter Road feature, therefore improve the accuracy rate of speech recognition.In addition, taking empty convolutional neural networks in acoustic model, utilize The empty characteristic of empty convolutional layer can efficiently reduce calculation amount in the case where identical receptive field, compared to CNN, accelerate The speed of speech recognition.

In conclusion the speech recognition effect for the audio recognition method that the embodiment of the present disclosure provides is preferable.

Fig. 7 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.Referring to Fig. 7, the device packet Include acquiring unit 701, extraction unit 702, integrated unit 703 and processing unit 704.

Acquiring unit 701 is configured as obtaining audio frame to be identified；

Extraction unit 702 is configured to extract the Meier scale filter group feature of the audio frame and sounding is used Family dope vector；

Integrated unit 703 is configured as to the Meier scale filter group feature and the sounding user information vector Fusion treatment is carried out, fusion feature is obtained；

Processing unit 704 is configured as handling the fusion feature based on target acoustical model, obtains the sound The speech recognition result of frequency frame, the target acoustical model include multiple empty convolutional layers.

The device that the embodiment of the present disclosure provides, in speech recognition process, the embodiment of the present disclosure can extract audio frame simultaneously Meier scale filter group feature and sounding user information vector, later, by the carry out Fusion Features of the two types, and will Fused feature carries out speech recognition as acoustic feature input acoustic model, since fused feature can be to speaker Feature and channel characteristics carry out effective expression, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition；In addition, sound Learning in model includes multiple empty convolutional layers, can effectively be subtracted under identical receptive field using the empty characteristic of empty convolutional layer Few calculation amount, therefore accelerate the speed of speech recognition, i.e., the audio recognition method that the embodiment of the present disclosure provides can effectively change Kind speech recognition effect.

In one possible implementation, referring to Fig. 8, integrated unit 703, comprising:

First processing subelement 7031, is configured as carrying out standardization processing to the Meier scale filter group feature, Obtain the first intermediate features；

Second processing subelement 7032 is configured as carrying out dimension conversion process to the sounding user information vector, obtain To the second intermediate features, the dimension of second intermediate features is greater than the dimension of the sounding user information vector；

Third handles subelement 7033, is configured as carrying out standardization processing to second intermediate features, obtains third Intermediate features；

Subelement 7034 is merged, is configured as carrying out at fusion first intermediate features and the third intermediate features Reason, obtains the fusion feature.

In one possible implementation, the first processing subelement 7031, is additionally configured to based on the first BatchNorm Layer, by the Meier scale filter group characteristic criterion be 0 to mean value and variance is 1, obtains first intermediate features；

Third handles subelement 7033, is configured as based on the 2nd BatchNorm layers, by the second intermediate features specification Change is 0 to mean value and variance is 1, obtains the third intermediate features.

Processing unit 704 is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively pass through The multiple cavity convolutional layer handles the fusion feature, wherein the output of upper one empty convolutional layer is next The input of empty convolutional layer；Using the first output result of the last one empty convolutional layer as the input of the LSTM network, according to It is secondary that the first output result is handled by the multiple LSTM layers, wherein upper one LSTM layers of output is next A LSTM layers of input；Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.

In one possible implementation, merge subelement 7034, be additionally configured to first intermediate features and The third intermediate features carry out column exchange processing, obtain the fusion feature；Or, based on weight matrix among described first Feature and the third intermediate features are weighted conversion process, obtain the fusion feature.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Fig. 9 is a kind of structural block diagram for speech recognition equipment that the embodiment of the present disclosure provides, which can be service Device.The device 900 can generate bigger difference because configuration or performance are different, may include one or more processors (central processing units, CPU) 901 and one or more memory 902, wherein the memory At least one instruction is stored in 902, at least one instruction is loaded by the processor 901 and executed above-mentioned each to realize The audio recognition method that a embodiment of the method provides.Certainly, the device can also have wired or wireless network interface, keyboard with And the components such as input/output interface, to carry out input and output, which can also include other for realizing functions of the equipments Component, this will not be repeated here.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the memory for example including instruction, Above-metioned instruction can be executed by the processor in terminal to complete the audio recognition method in above-described embodiment.For example, the calculating Machine readable storage medium storing program for executing can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices Deng.

Figure 10 shows the structural block diagram of the device 1000 of one exemplary embodiment of disclosure offer.The device 1000 can To be mobile terminal.

In general, device 1000 includes: processor 1001 and memory 1002.

Processor 1001 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 1001 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 1001 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 1001 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 1001 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 1002 may include one or more computer readable storage mediums, which can To be non-transient.Memory 1002 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 1002 can Storage medium is read for storing at least one instruction, at least one instruction for performed by processor 1001 to realize this public affairs Open the audio recognition method that middle embodiment of the method provides.

In some embodiments, device 1000 is also optional includes: peripheral device interface 1003 and at least one periphery are set It is standby.It can be connected by bus or signal wire between processor 1001, memory 1002 and peripheral device interface 1003.It is each outer Peripheral equipment can be connected by bus, signal wire or circuit board with peripheral device interface 1003.Specifically, peripheral equipment includes: In radio circuit 1004, touch display screen 1005, camera 1006, voicefrequency circuit 1007, positioning component 1008 and power supply 1009 At least one.

Peripheral device interface 1003 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 1001 and memory 1002.In some embodiments, processor 1001, memory 1002 and periphery Equipment interface 1003 is integrated on same chip or circuit board；In some other embodiments, processor 1001, memory 1002 and peripheral device interface 1003 in any one or two can be realized on individual chip or circuit board, this implementation Example is not limited this.

Radio circuit 1004 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal. Radio circuit 1004 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 1004 is by telecommunications Number being converted to electromagnetic signal is sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 1004 include: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, volume solution Code chipset, user identity module card etc..Radio circuit 1004 can by least one wireless communication protocol come with it is other Terminal is communicated.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some implementations In example, radio circuit 1004 can also include that NFC (Near Field Communication, wireless near field communication) is related Circuit, the disclosure are not limited this.

Display screen 1005 is for showing UI (User Interface, user interface).The UI may include figure, text, Icon, video and its their any combination.When display screen 1005 is touch display screen, display screen 1005 also there is acquisition to exist The ability of the touch signal on the surface or surface of display screen 1005.The touch signal can be used as control signal and be input to place Reason device 1001 is handled.At this point, display screen 1005 can be also used for providing virtual push button and/or dummy keyboard, it is also referred to as soft to press Button and/or soft keyboard.In some embodiments, display screen 1005 can be one, and the front panel of device 1000 is arranged；Another In a little embodiments, display screen 1005 can be at least two, be separately positioned on the different surfaces of device 1000 or in foldover design； In still other embodiments, display screen 1005 can be flexible display screen, is arranged on the curved surface of device 1000 or folds On face.Even, display screen 1005 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 1005 can be with Using LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) etc. materials preparation.

CCD camera assembly 1006 is for acquiring image or video.Optionally, CCD camera assembly 1006 includes front camera And rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.? In some embodiments, rear camera at least two is that main camera, depth of field camera, wide-angle camera, focal length are taken the photograph respectively As any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide Pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are realized in camera fusion in angle Shooting function.In some embodiments, CCD camera assembly 1006 can also include flash lamp.Flash lamp can be monochromatic temperature flash of light Lamp is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for Light compensation under different-colour.

Voicefrequency circuit 1007 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and It converts sound waves into electric signal and is input to processor 1001 and handled, or be input to radio circuit 1004 to realize that voice is logical Letter.For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of device 1000 to be multiple. Microphone can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 1001 or radio frequency will to be come from The electric signal of circuit 1004 is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramics loudspeaking Device.When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, can also be incited somebody to action Electric signal is converted to the sound wave that the mankind do not hear to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 1007 may be used also To include earphone jack.

Positioning component 1008 is used for the current geographic position of positioning device 1000, to realize navigation or LBS (Location Based Service, location based service).Positioning component 1008 can be the GPS (Global based on the U.S. Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group Part.

Power supply 1009 is used to be powered for the various components in device 1000.Power supply 1009 can be alternating current, direct current Electricity, disposable battery or rechargeable battery.When power supply 1009 includes rechargeable battery, which can be line charge Battery or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is to pass through The battery of wireless coil charging.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, device 1000 further includes having one or more sensors 1010.One or more sensing Device 1010 includes but is not limited to: acceleration transducer 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensing Device 1014, optical sensor 1015 and proximity sensor 1016.

Acceleration transducer 1011 can detecte the acceleration in three reference axis of the coordinate system established with device 1000 Size.For example, acceleration transducer 1011 can be used for detecting component of the acceleration of gravity in three reference axis.Processor The 1001 acceleration of gravity signals that can be acquired according to acceleration transducer 1011, control touch display screen 1005 with transverse views Or longitudinal view carries out the display of user interface.Acceleration transducer 1011 can be also used for game or the exercise data of user Acquisition.

Gyro sensor 1012 can detecte body direction and the rotational angle of device 1000, gyro sensor 1012 Acquisition user can be cooperateed with to act the 3D of device 1000 with acceleration transducer 1011.Processor 1001 is according to gyro sensors The data that device 1012 acquires, following function may be implemented: action induction (for example changing UI according to the tilt operation of user) is clapped Image stabilization, game control and inertial navigation when taking the photograph.

The lower layer of side frame and/or touch display screen 1005 in device 1000 can be set in pressure sensor 1013.When When the side frame of device 1000 is arranged in pressure sensor 1013, user can detecte to the gripping signal of device 1000, by Reason device 1001 carries out right-hand man's identification or prompt operation according to the gripping signal that pressure sensor 1013 acquires.Work as pressure sensor 1013 when being arranged in the lower layer of touch display screen 1005, is grasped by processor 1001 according to pressure of the user to touch display screen 1005 Make, realization controls the operability control on the interface UI.Operability control include button control, scroll bar control, At least one of icon control, menu control.

Fingerprint sensor 1014 is used to acquire the fingerprint of user, is collected by processor 1001 according to fingerprint sensor 1014 Fingerprint recognition user identity, alternatively, by fingerprint sensor 1014 according to the identity of collected fingerprint recognition user.Knowing Not Chu the identity of user when being trusted identity, authorize the user to execute relevant sensitive operation by processor 1001, which grasps Make to include solving lock screen, checking encryption information, downloading software, payment and change setting etc..Fingerprint sensor 1014 can be set Set the front, the back side or side of device 1000.When being provided with physical button or manufacturer Logo on device 1000, fingerprint sensor 1014 can integrate with physical button or manufacturer Logo.

Optical sensor 1015 is for acquiring ambient light intensity.In one embodiment, processor 1001 can be according to light The ambient light intensity that sensor 1015 acquires is learned, the display brightness of touch display screen 1005 is controlled.Specifically, work as ambient light intensity When higher, the display brightness of touch display screen 1005 is turned up；When ambient light intensity is lower, the aobvious of touch display screen 1005 is turned down Show brightness.In another embodiment, the ambient light intensity that processor 1001 can also be acquired according to optical sensor 1015, is moved The acquisition parameters of state adjustment CCD camera assembly 1006.

Proximity sensor 1016, also referred to as range sensor are generally arranged at the front panel of device 1000.Proximity sensor 1016 for acquiring the distance between the front of user Yu device 1000.In one embodiment, when proximity sensor 1016 is examined When measuring the distance between the front of user and device 1000 and gradually becoming smaller, by processor 1001 control touch display screen 1005 from Bright screen state is switched to breath screen state；When proximity sensor 1016 detect the distance between front of user and device 1000 by When gradual change is big, touch display screen 1005 is controlled by processor 1001 and is switched to bright screen state from breath screen state.

It, can be with it will be understood by those skilled in the art that the restriction of the not structure twin installation 1000 of structure shown in Figure 10 Including than illustrating more or fewer components, perhaps combining certain components or being arranged using different components.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.The disclosure is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of audio recognition method characterized by comprising

Obtain audio frame to be identified；

Fusion treatment is carried out to the Meier scale filter group feature and the sounding user information vector, it is special to obtain fusion Sign；

The fusion feature is handled based on target acoustical model, obtains the speech recognition result of the audio frame, it is described Target acoustical model includes multiple empty convolutional layers.

2. audio recognition method according to claim 1, which is characterized in that described special to the Meier scale filter group The sounding user information vector of seeking peace carries out fusion treatment, comprising:

Dimension conversion process is carried out to the sounding user information vector, obtains the second intermediate features, second intermediate features Dimension be greater than the sounding user information vector dimension；

3. audio recognition method according to claim 2, which is characterized in that described special to the Meier scale filter group Sign carries out standardization processing, obtains the first intermediate features, comprising:

It is 0 and side by the Meier scale filter group characteristic criterion to mean value based on first BatchNorm layers of standardization Difference is 1, obtains first intermediate features；

Based on the 2nd BatchNorm layers, it is 0 and variance is 1 that second intermediate features, which are standardized to mean value, obtains described Three intermediate features.

4. audio recognition method according to claim 1, which is characterized in that the target acoustical model includes empty convolution Neural network and shot and long term remember LSTM network, and the cavity convolutional neural networks include the multiple empty convolutional layer, described LSTM network includes LSTM layers multiple；

It is described that the fusion feature is handled based on target acoustical model, the speech recognition result of the audio frame is obtained, Include:

By the fusion feature input empty convolutional neural networks, successively melt by the multiple empty convolutional layer to described It closes feature to be handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer；

Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively by the multiple LSTM layers handle the first output result, wherein upper one LSTM layers of output is next LSTM layers of input；

5. audio recognition method according to claim 2, which is characterized in that described to first intermediate features and described Third intermediate features carry out fusion treatment, obtain the fusion feature, comprising:

Column exchange processing is carried out to first intermediate features and the third intermediate features, obtains the fusion feature；Or,

Conversion process is weighted to first intermediate features and the third intermediate features based on weight matrix, is obtained described Fusion feature.

6. a kind of speech recognition equipment characterized by comprising

Acquiring unit is configured as obtaining audio frame to be identified；

Extraction unit is configured to extract the Meier scale filter group feature of the audio frame and sounding user information arrow Amount；

Integrated unit is configured as merging the Meier scale filter group feature and the sounding user information vector Processing, obtains fusion feature；

Processing unit is configured as handling the fusion feature based on target acoustical model, obtains the audio frame Speech recognition result, the target acoustical model include multiple empty convolutional layers.

7. speech recognition equipment according to claim 6, which is characterized in that the integrated unit, comprising:

First processing subelement, is configured as carrying out standardization processing to the Meier scale filter group feature, obtains first Intermediate features；

Second processing subelement is configured as carrying out dimension conversion process to the sounding user information vector, obtain in second Between feature, the dimensions of second intermediate features is greater than the dimension of the sounding user information vector；

Third handles subelement, is configured as carrying out standardization processing to second intermediate features, obtains third intermediate features；

8. speech recognition equipment according to claim 6, which is characterized in that the target acoustical model includes empty convolution Neural network and shot and long term remember LSTM network, and the cavity convolutional neural networks include the multiple empty convolutional layer, described LSTM network includes LSTM layers multiple；

The processing unit is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively passes through institute It states multiple empty convolutional layers to handle the fusion feature, wherein the output of upper one empty convolutional layer is next sky The input of hole convolutional layer；Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively The first output result is handled by the multiple LSTM layers, wherein upper one LSTM layers of output is next LSTM layers of input；Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.

9. a kind of speech recognition equipment characterized by comprising

Processor；

Memory for storage processor executable instruction；

Wherein, the processor is configured to: execute speech recognition described in any claim in the claims 1 to 5 Method.

10. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by voice When the processor of identification device executes, so that speech recognition equipment is able to carry out any claim in the claims 1 to 5 The audio recognition method.