CN110047468A - Audio recognition method, device and storage medium - Google Patents
Audio recognition method, device and storage medium Download PDFInfo
- Publication number
- CN110047468A CN110047468A CN201910418620.XA CN201910418620A CN110047468A CN 110047468 A CN110047468 A CN 110047468A CN 201910418620 A CN201910418620 A CN 201910418620A CN 110047468 A CN110047468 A CN 110047468A
- Authority
- CN
- China
- Prior art keywords
- feature
- intermediate features
- speech recognition
- fusion
- obtains
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 230000004927 fusion Effects 0.000 claims abstract description 93
- 238000012545 processing Methods 0.000 claims description 64
- 238000013527 convolutional neural network Methods 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 21
- 238000006243 chemical reaction Methods 0.000 claims description 14
- 230000015654 memory Effects 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000007774 longterm Effects 0.000 claims 2
- 238000004364 calculation method Methods 0.000 abstract description 9
- 239000000284 extract Substances 0.000 abstract description 9
- 238000010801 machine learning Methods 0.000 abstract description 4
- 230000008859 change Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000001133 acceleration Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 230000002093 peripheral effect Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 2
- 238000000556 factor analysis Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 238000007499 fusion processing Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
- Telephonic Communication Services (AREA)
Abstract
The disclosure is directed to a kind of audio recognition method, device and storage mediums, belong to machine learning techniques field.Method includes: to obtain audio frame to be identified;The Meier scale filter group feature and sounding user information vector of audio frame are extracted respectively;Fusion treatment is carried out to Meier scale filter group feature and sounding user information vector, obtains fusion feature;Fusion feature is handled based on target acoustical model, obtains the speech recognition result of audio frame, target acoustical model includes multiple empty convolutional layers.The disclosure can extract the Meier scale filter group feature and sounding user information vector of audio frame simultaneously, later, the two is subjected to Fusion Features and fused feature is inputted into acoustic model, since fused feature can carry out effective expression to speaker characteristic and channel characteristics, the accuracy rate of speech recognition is improved;In addition, including multiple empty convolutional layers in acoustic model, calculation amount can be reduced under identical receptive field, accelerate speech recognition speed.
Description
Technical field
This disclosure relates to machine learning techniques field more particularly to a kind of audio recognition method, device and storage medium.
Background technique
Speech recognition is also referred to as automatic speech recognition (Automatic Speech Recognition, ASR), is one and allows
Voice signal is changed into the technology of corresponding text or order by identifying by machine with understanding process.Speech recognition technology at present
It is multiple industry, household electrical appliances, communication, automotive electronics, medical treatment, home services, consumption electronic product etc. have been widely used in
Field.
Wherein, in speech recognition process, the accuracy of speech recognition and speed are most important.It is well known that voice is known
Other accuracy rate is higher and speed is faster, and the satisfaction of user is just higher.For this purpose, how accurately and quickly to carry out voice knowledge
Not, to improve speech recognition effect, become those skilled in the art's problem urgently to be resolved.
Summary of the invention
The disclosure provides a kind of audio recognition method, device and storage medium, can be effectively improved speech recognition effect.
According to the first aspect of the embodiments of the present disclosure, a kind of audio recognition method is provided, comprising:
Obtain audio frame to be identified;
The Meier scale filter group feature and sounding user information vector of the audio frame are extracted respectively;
Fusion treatment is carried out to the Meier scale filter group feature and the sounding user information vector, is merged
Feature;
The fusion feature is handled based on target acoustical model, obtains the speech recognition result of the audio frame,
The target acoustical model includes multiple empty convolutional layers.
In one possible implementation, described that the Meier scale filter group feature and the sounding user are believed
It ceases vector and carries out fusion treatment, comprising:
Standardization processing is carried out to the Meier scale filter group feature, obtains the first intermediate features;
Dimension conversion process is carried out to the sounding user information vector, obtains the second intermediate features, among described second
The dimension of feature is greater than the dimension of the sounding user information vector;
Standardization processing is carried out to second intermediate features, obtains third intermediate features;
Fusion treatment is carried out to first intermediate features and the third intermediate features, obtains the fusion feature.
It is in one possible implementation, described that standardization processing is carried out to the Meier scale filter group feature,
Obtain the first intermediate features, comprising:
Based on the first BatchNorm (batch standardization) layer, by the Meier scale filter group characteristic criterion to mean value
For 0 and variance is 1, obtains first intermediate features;
It is described that standardization processing is carried out to second intermediate features, obtain third intermediate features, comprising:
Based on the 2nd BatchNorm layers, it is 0 and variance is 1 that second intermediate features, which are standardized to mean value, obtains institute
State third intermediate features.
In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM
(Long Short-Term Memory, shot and long term memory) network, the cavity convolutional neural networks include the multiple cavity
Convolutional layer, the LSTM network include LSTM layers multiple;
It is described that the fusion feature is handled based on target acoustical model, obtain the speech recognition knot of the audio frame
Fruit, comprising:
By the fusion feature input empty convolutional neural networks, successively by the multiple empty convolutional layer to institute
It states fusion feature to be handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer;
Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively by described
Multiple LSTM layers handle the first output result, wherein upper one LSTM layers of output is next LSTM layers
Input;
Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
In one possible implementation, described that first intermediate features and the third intermediate features are melted
Conjunction processing, obtains the fusion feature, comprising:
Column exchange processing is carried out to first intermediate features and the third intermediate features, obtains the fusion feature;
Or,
Conversion process is weighted to first intermediate features and the third intermediate features based on weight matrix, is obtained
The fusion feature.
According to the second aspect of an embodiment of the present disclosure, a kind of speech recognition equipment is provided, comprising:
Acquiring unit is configured as obtaining audio frame to be identified;
Extraction unit is configured to extract the Meier scale filter group feature of the audio frame and sounding user letter
Cease vector;
Integrated unit is configured as carrying out the Meier scale filter group feature and the sounding user information vector
Fusion treatment obtains fusion feature;
Processing unit is configured as handling the fusion feature based on target acoustical model, obtains the audio
The speech recognition result of frame, the target acoustical model include multiple empty convolutional layers.
In one possible implementation, the integrated unit, comprising:
First processing subelement, is configured as carrying out standardization processing to the Meier scale filter group feature, obtain
First intermediate features;
Second processing subelement is configured as carrying out dimension conversion process to the sounding user information vector, obtains the
Two intermediate features, the dimension of second intermediate features are greater than the dimension of the sounding user information vector;
Third handles subelement, is configured as carrying out standardization processing to second intermediate features, obtain among third
Feature;
Subelement is merged, is configured as carrying out fusion treatment to first intermediate features and the third intermediate features,
Obtain the fusion feature.
In one possible implementation, the first processing subelement, is additionally configured to based on the first BatchNorm
Layer, by the Meier scale filter group characteristic criterion be 0 to mean value and variance is 1, obtains first intermediate features;
The third handles subelement, is configured as based on the 2nd BatchNorm layers, by the second intermediate features specification
Change is 0 to mean value and variance is 1, obtains the third intermediate features.
In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM net
Network, the cavity convolutional neural networks include the multiple empty convolutional layer, and the LSTM network includes LSTM layers multiple;
The processing unit is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively pass through
It crosses the multiple empty convolutional layer to handle the fusion feature, wherein the output of upper one empty convolutional layer is next
The input of a cavity convolutional layer;Using the first output result of the last one empty convolutional layer as the input of the LSTM network,
Successively the first output result is handled by the multiple LSTM layers, wherein under upper one LSTM layers of output is
One LSTM layers of input;Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
In one possible implementation, the fusion subelement, be additionally configured to first intermediate features and
The third intermediate features carry out column exchange processing, obtain the fusion feature;Or, based on weight matrix among described first
Feature and the third intermediate features are weighted conversion process, obtain the fusion feature.
According to the third aspect of an embodiment of the present disclosure, a kind of speech recognition equipment is provided, comprising:
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute audio recognition method described in above-mentioned first aspect.
According to a fourth aspect of embodiments of the present disclosure, a kind of non-transitorycomputer readable storage medium is provided, when described
When instruction in storage medium is executed by the processor of speech recognition equipment, so that speech recognition equipment is able to carry out above-mentioned first
Audio recognition method described in aspect.
According to a fifth aspect of the embodiments of the present disclosure, a kind of application program is provided, when the instruction in the application program by
When the processor of speech recognition equipment executes, so that speech recognition equipment is able to carry out speech recognition described in above-mentioned first aspect
Method.
The technical solution that the embodiment of the present disclosure provides can include the following benefits:
In speech recognition process, the embodiment of the present disclosure can extract simultaneously audio frame Meier scale filter group feature and
Sounding user information vector, later, by the carry out Fusion Features of the two types, and using fused feature as acoustic feature
It inputs acoustic model and carries out speech recognition, since fused feature can carry out effective table to speaker characteristic and channel characteristics
It reaches, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition;In addition, including multiple empty convolution in acoustic model
Layer, can efficiently reduce calculation amount using the empty characteristic of empty convolutional layer, therefore accelerate voice under identical receptive field
The audio recognition method that the speed of identification, the i.e. embodiment of the present disclosure provide can be effectively improved speech recognition effect.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is a kind of schematic diagram of empty convolution shown according to an exemplary embodiment.
Fig. 2 is the structural representation for the implementation environment that a kind of audio recognition method shown according to an exemplary embodiment is related to
Figure.
Fig. 3 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment.
Fig. 4 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment.
Fig. 5 is a kind of schematic diagram of multiple features fusion process shown according to an exemplary embodiment.
Fig. 6 is a kind of structural schematic diagram of acoustic model shown according to an exemplary embodiment.
Fig. 7 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.
Fig. 8 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.
Fig. 9 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.
Figure 10 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended
The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Before to the embodiment of the present disclosure carrying out that explanation is explained in detail, some names that first embodiment of the present disclosure is related to
Word is explained.
Meier scale filter group feature: in the embodiments of the present disclosure, Meier scale filter group feature refers to
FilterBank feature, also referred to as FBank feature.Wherein, FilterBank algorithm is a kind of front-end processing algorithm, to be similar to people
The mode of ear handles audio, can be improved speech recognition performance.
Sounding user information vector: in the embodiments of the present disclosure, sounding user information vector refers to i-vector feature.Its
In, i-vector feature not only includes speaker's different information, but also including channel different information.A kind of expression way is changed,
I-vector feature can effectively be indicated speaker characteristic and channel characteristics that is, i-vector feature is used to speaker
Character representation is carried out with channel.
Empty convolution (Dilated Convolution): also known as expansion convolution or expansion convolution are capable of increasing receptive field
(reception field).Wherein, it in convolutional neural networks, determines in a certain layer output result corresponding to an element
The area size of input layer, is called receptive field.It is that receptive field is in convolutional neural networks with the language expression of mathematics
One element of a certain layer output result corresponds to a mapping of input layer.
Referring to Fig. 1, the empty convolution operation of empty multiplying power (dilated rate) equal to 1,2 and 3 when is respectively illustrated.Its
In, the left figure of Fig. 1 corresponds to the cavity the 1-dilated convolution operation of 3x3 convolution kernel size, which grasps with common convolution
Make the same.The middle figure of Fig. 1 corresponds to the cavity the 2-dilated convolution operation of 3x3 convolution kernel size, and actual convolution kernel size is still
3x3, but cavity is 1, that is to say, that the feature for the characteristic area of a 7x7 size, at only 9 black square blocks
Convolution operation occurs with the convolution kernel of 3x3 size, remaining is skipped over.It can be appreciated that the size of convolution kernel is 7x7, but only
Having the weight at 9 in figure black square blocks is not 0, remaining is 0.It can see by middle figure, although the size of convolution kernel
Only 3x3, but the receptive field size of this convolution has had increased to 7x7.Wherein, it is big to correspond to 3x3 convolution kernel for the right figure of Fig. 1
The small cavity 3-dilated convolution operation.
Criticize normalization layer: in the embodiments of the present disclosure, batch normalization layer refers to BatchNorm layers.Wherein, BatchNorm
The effect of layer is by certain standardization means, just by the distribution transformation of input data to mean value is 0, variance is 1 standard
State distribution.
Explanation is introduced in the implementation environment that a kind of audio recognition method provided below the embodiment of the present disclosure is related to.
The audio recognition method that the embodiment of the present disclosure provides is applied to speech recognition apparatus.Referring to fig. 2, speech recognition apparatus
201 be the computer equipment with machine learning ability, for example, the computer equipment can be PC, server etc. admittedly
Fixed pattern computer equipment, can also be the Mobile Computings machine equipment such as tablet computer, smart phone, the embodiment of the present disclosure to this not
Specifically limited.Wherein, speech recognition apparatus 201 includes the characteristic extracting module for carrying out front-end processing, and carries out rear end
The acoustic model of processing.
It is well known that the accuracy rate and speed of speech recognition are most important in speech recognition process.The relevant technologies into
When row speech recognition, for front-end processing, usually extraction MFCC (Mel Frequency Cepstrum Coefficient, plum
That frequency cepstral coefficient) feature, and MFCC feature can not carry out effectively character representation to speaker's difference and channel difference.Needle
To back-end processing, the relevant technologies are made by using CNN (Convolutional Neural Networks, convolutional neural networks)
It is identified for acoustic model, and CNN belongs to feedforward neural network, the convolution kernel of this time dimension convolution is in time dimension
Left and right unfolding calculation, that is, rely on history audio frame and future audio frame;And more left and right audio frames is relied on, convolutional Neural
The receptive field of network is just bigger, and when multilayer convolutional neural networks stack, the left and right audio frame for needing to rely on can be more, calculation amount
It is very huge, and then lead to the reduction of speech recognition speed.
For this purpose, the embodiment of the present disclosure provides a kind of audio recognition method, on the one hand, this kind of audio recognition method will do it
Multiple features fusion, i.e., can extract FBank feature and i-vector feature simultaneously when carrying out feature extraction, and by FBank feature
Input with i-vector Fusion Features as speech recognition, to improve the accuracy rate of speech recognition.
On the other hand, this kind of audio recognition method is using a kind of new deep neural network as acoustic model, the acoustic mode
Type includes empty convolutional neural networks and LSTM network, i.e. acoustic model includes Dilated-CNN+LSTM, compared to CNN,
In the case where identical receptive field, it is able to maintain less convolution kernel, to efficiently reduce calculation amount, improves speech recognition
Speed.
In addition, empty convolution is bigger compared to the receptive field of CNN under identical calculation amount, and then can capture more
Information do not consider recognition speed for this angle, the accuracy rate of speech recognition also can be improved using empty convolution.
In addition, bigger empty multiplying power receptive field is bigger, and then more information can be captured, so using empty convolution
The accuracy rate of speech recognition can be improved.
Fig. 3 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment, as shown in figure 3, the party
Method is for including the following steps in speech recognition apparatus shown in Fig. 2.
In step 301, audio frame to be identified is obtained.
In step 302, the Meier scale filter group feature and sounding user information vector of audio frame are extracted respectively.
In step 303, fusion treatment is carried out to Meier scale filter group feature and sounding user information vector, obtained
Fusion feature.
In step 304, fusion feature is handled based on target acoustical model, obtains the speech recognition knot of audio frame
Fruit, target acoustical model include multiple empty convolutional layers.
The method that the embodiment of the present disclosure provides, in speech recognition process, the embodiment of the present disclosure can extract audio frame simultaneously
Meier scale filter group feature and sounding user information vector, later, by the carry out Fusion Features of the two types, and will
Fused feature carries out speech recognition as acoustic feature input acoustic model, since fused feature can be to speaker
Feature and channel characteristics carry out effective expression, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition;In addition, sound
Learning in model includes multiple empty convolutional layers, can effectively be subtracted under identical receptive field using the empty characteristic of empty convolutional layer
Few calculation amount, therefore accelerate the speed of speech recognition, i.e., the audio recognition method that the embodiment of the present disclosure provides can effectively change
Kind speech recognition effect.
In one possible implementation, described that the Meier scale filter group feature and the sounding user are believed
It ceases vector and carries out fusion treatment, comprising:
Standardization processing is carried out to the Meier scale filter group feature, obtains the first intermediate features;
Dimension conversion process is carried out to the sounding user information vector, obtains the second intermediate features, among described second
The dimension of feature is greater than the dimension of the sounding user information vector;
Standardization processing is carried out to second intermediate features, obtains third intermediate features;
Fusion treatment is carried out to first intermediate features and the third intermediate features, obtains the fusion feature.
It is in one possible implementation, described that standardization processing is carried out to the Meier scale filter group feature,
Obtain the first intermediate features, comprising:
It is 0 by the Meier scale filter group characteristic criterion to mean value based on first BatchNorm layers of standardization
And variance is 1, obtains first intermediate features;
It is described that standardization processing is carried out to second intermediate features, obtain third intermediate features, comprising:
Based on the 2nd BatchNorm layers, it is 0 and variance is 1 that second intermediate features, which are standardized to mean value, obtains institute
State third intermediate features.
In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM net
Network, the cavity convolutional neural networks include the multiple empty convolutional layer, and the LSTM network includes LSTM layers multiple;
It is described that the fusion feature is handled based on target acoustical model, obtain the speech recognition knot of the audio frame
Fruit, comprising:
By the fusion feature input empty convolutional neural networks, successively by the multiple empty convolutional layer to institute
It states fusion feature to be handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer;
Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively by described
Multiple LSTM layers handle the first output result, wherein upper one LSTM layers of output is next LSTM layers
Input;
Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
In one possible implementation, described that first intermediate features and the third intermediate features are melted
Conjunction processing, obtains the fusion feature, comprising:
Column exchange processing is carried out to first intermediate features and the third intermediate features, obtains the fusion feature;
Or,
Conversion process is weighted to first intermediate features and the third intermediate features based on weight matrix, is obtained
The fusion feature.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
It should be noted that the similar this description such as first, second, third, fourth occurred in following embodiments, is only
For distinguishing different objects, without constituting any other particular determination to each object.
Fig. 4 is a kind of flow chart of audio recognition method shown according to an exemplary embodiment, as shown in figure 4, the party
Method is for including the following steps in speech recognition apparatus shown in Fig. 2.
In step 401, audio frame to be identified is obtained.
Wherein, audio frame generally refers to a bit of audio of regular length.As an example, in speech recognition, lead to
10 to 30ms (milliseconds) often are set by frame length, i.e., the playing duration of one audio frame is 10 to 30ms, existing in such frame
Enough periods, and will not change too acutely.In the embodiments of the present disclosure, the playing duration of an audio frame is 25ms, i.e.,
Frame length is 25ms, and it is 10ms that frame, which moves,.
In one possible implementation, speech recognition apparatus is before carrying out feature extraction, it will usually to speaker
Voice pre-processed, wherein pretreatment includes but is not limited to framing, pre- enhancing, adding window, noise reduction etc..
In addition, the voice of speaker either speech recognition apparatus configuration the collected voice of voice acquisition device,
It is also possible to the voice that other equipment are sent to speech recognition apparatus, the embodiment of the present disclosure is to this without specifically limiting.
As an example, the embodiment of the present disclosure can carry out speech recognition to one audio of an audio frame frame by frame, alternatively,
Speech recognition is carried out to multiple audios, the embodiment of the present disclosure is to this equally without specifically limiting.
In step 402, the FBank feature and i-vector feature of audio frame to be identified are extracted respectively.
Extract FBank feature
For FBank feature, FBank feature extraction needs to carry out after pre-processing, and at this moment the voice of speaker has been
Framing is completed, audio frame one by one has been obtained, that is, has needed to extract FBank feature frame by frame.Due to carrying out obtaining after sub-frame processing
To be still time-domain signal, and in order to extract FBank feature, just need time-domain signal being converted to frequency-region signal first.
In one possible implementation, signal can be gone to frequency domain from time domain by Fourier transformation.Further, Fu
In leaf transformation can be divided into continuous fourier transform and discrete Fourier transform again, due to audio frame be digital audio and it is non-analog
Audio, therefore the embodiment of the present disclosure extracts FBank feature using discrete Fourier transform.As an example, usually using FFT
(Fast Fourier Transformation, Fast Fourier Transform (FFT)) carries out FBank feature extraction frame by frame.
In one possible implementation, for carrying out speech recognition frame by frame, the dimension of FBank feature can be 40
Dimension, the embodiment of the present disclosure is to this without specifically limiting.
Extract i-vector feature
JFA (Joint Factor Analysis, simultaneous factor analysis) method is using GMM (Gaussian Mixture
Model, gauss hybrid models) super vector space subspace, speaker's difference and channel difference are modeled respectively, to classify
Channel disturbance out.However, channel factors can also carry the information of part speaker in JFA model, when compensating, can damage
Lose a part of speaker information.Based on this, global disparity spatial model is proposed, using speaker's difference and channel difference as one
A entirety is modeled, this kind of method improves the high problem of requirement and JFA computation complexity of the JFA to training corpus, simultaneously
Performance is also suitable with JFA.
The Duan Yuyin of given speaker, corresponding Gaussian mean super vector can be defined as:
M=m+Tw
Wherein, M is the Gaussian mean super vector of given voice;M be UBM (Uniform Background Model, it is general
Background model) Gaussian mean super vector, the super vector is unrelated with specific speaker and channel;T is global disparity spatial moment
Battle array, low-rank;W is global disparity steric factor, its Posterior Mean is i-vector feature, it a priori obeys standard normal
Distribution.
In above-mentioned formula, M and m can be calculated, and global disparity space matrix T and global disparity steric factor w
Estimated.Wherein, global disparity space matrix T thinks that all given voices are all from different speakers, even if
It is that the multistage voice of the same speaker is similarly considered from different people.I-vector characterizing definition is global disparity space
The maximum posteriori point estimation of factor w, that is to say the Posterior Mean of w.In one possible implementation, the overall situation is obtained in estimation
After difference space matrix T, to the voice of given speaker, zero and first order Baum-Welch statistic is extracted, can be calculated
The estimated value of i-vector feature.
In one possible implementation, for carrying out speech recognition frame by frame, the dimension of i-vector feature can be
100 dimensions, the embodiment of the present disclosure is to this without specifically limiting.
In step 403, fusion treatment is carried out to the FBank feature and i-vector feature extracted, it is special obtains fusion
Sign.
In one possible implementation, fusion treatment is carried out to the FBank feature and i-vector feature extracted,
The following steps are included:
4031, standardization processing is carried out to FBank feature, obtains the first intermediate features.
Referring to Fig. 5, it illustrates multiple features fusion processes.For FBank feature, a BatchNorm can be passed through first
Layer carries out standardization processing.It should be noted that for the ease of distinguishing, this BatchNorm layers is referred to as by the embodiment of the present disclosure
It is the first BatchNorm layers.
In one possible implementation, standardization processing is carried out to FBank feature, obtains the first intermediate features, wrapped
Include but be not limited to: based on the first BatchNorm layers, by FBank characteristic criterion to mean value be 0 and variance is 1, is obtained in first
Between feature.
Wherein, as shown in figure 5, FBank feature after BatchNorm layers dimension do not change, still be 40 dimension.
In addition, being referred to herein as the first intermediate features by BatchNorm layers of FBank feature for the ease of appellation.
4032, dimension conversion process is carried out to i-vector feature, obtains the second intermediate features.
As shown in figure 5, to the i-vector feature for carrying speaker characteristic and channel characteristics carry out standardization processing it
Before, Linear Mapping (linear) layer can be first passed through, dimension conversion process is carried out to i-vector feature.A kind of possible
In implementation, which is to rise dimension processing, i.e. it is special to be greater than initial i-vector for the dimension of the second intermediate features
The dimension of sign.For example, i-vector feature is 100 dimensions before by linear layers, by the linear layers of i- by 100 dimensions
Vector Feature Mapping is known as 200 dimensions.
Wherein, it is also referred to as the second intermediate features herein by linear layers of i-vector feature.
4033, standardization processing is carried out to the second intermediate features, obtains third intermediate features.
Referring to Fig. 5, i-vector feature after by linear layers, can also using one BatchNorm layers, in order to
Convenient for distinguishing, the embodiment of the present disclosure is by this BatchNorm layer referred to as the 2nd BatchNorm layers.
Similarly, standardization processing is carried out to the second intermediate features, obtains third intermediate features, including but not limited to: base
In the 2nd BatchNorm layers, it is 0 and variance is 1 that the second intermediate features, which are standardized to mean value, obtains third intermediate features.
As shown in figure 5, dimension does not become after linear layers of i-vector feature is using BatchNorm layers
Change, is still 200 dimensions.In addition, being also referred to as herein for the ease of appellation by BatchNorm layers of the second intermediate features
For third intermediate features.
4034, fusion treatment is carried out to the first intermediate features and third intermediate features, obtains fusion feature.
Next, referring to Fig. 5, FBank feature and i-vector feature after two BatchNorm layers together as
Input is input to fusion (combine) layer and carries out fusion treatment.
Wherein, combine layers either a Linear Mapping layer is also possible to a column switching layer, the embodiment of the present disclosure
To this without specifically limiting.In one possible implementation, combine layers of parameter can random initializtion and based on anti-
It optimizes to propagation algorithm, for example is optimized using stochastic gradient descent algorithm, the embodiment of the present disclosure is to this without tool
Body limits.
As an example, fusion treatment is carried out to the first intermediate features and third intermediate features, obtains fusion feature, wrapped
It includes but is not limited to take following two mode:
The first, when combine layers be column switching layer when, column exchange is carried out to the first intermediate features and third intermediate features
Processing, obtains fusion feature.Wherein, the dimension of fusion feature is consistent with the dimension of the first intermediate features and third intermediate features.
As shown in figure 5, when third intermediate features are 200 dimension, the dimension of fusion feature is 40* when the dimension of the first intermediate features is 40
6.As an example, column exchange processing is used for the column and column of exchange features, for example first row feature is exchanged with secondary series feature,
Or first row feature is exchanged with last column feature, the embodiment of the present disclosure is to this without specifically limiting.
Second, when being linear layers for combine layers, based on weight matrix to special among the first intermediate features and third
Sign is weighted conversion process, obtains fusion feature.Merged by linear layer, be equivalent to the first intermediate features with
Third intermediate features are multiplied by a weight.Wherein, the weight matrix can random initializtion and and acoustic model combine instruction together
It gets, the embodiment of the present disclosure is to this without specifically limiting.
After the FBank feature by after two BatchNorm layers and i-vector feature are carried out above-mentioned fusion treatment,
It is capable of the robustness of Enhanced feature.
In step 404, fusion feature is handled based on target acoustical model, obtains the language of audio frame to be identified
Sound recognition result, wherein target acoustical model includes empty convolutional neural networks and LSTM network, the cavity convolutional neural networks
Including multiple empty convolutional layers, which includes LSTM layers multiple.
In the embodiments of the present disclosure, in above-mentioned steps 403 combine layers output be target acoustical model input.
In one possible implementation, referring to Fig. 6, empty convolutional neural networks include 6 empty convolutional layers altogether,
It is referred to respectively with empty convolutional layer 0 to empty convolutional layer 5.Wherein, empty convolution is 2 dimension convolution, and convolution kernel size is M*N,
In, M represents the size of convolution, and N represents the size of frequency domain convolution.As an example, empty convolutional layer 0 is to empty convolution
Every layer of convolution kernel is as described below in totally 6 empty convolutional layers for layer 5:
For the 0th layer of empty convolution, convolution kernel size 7*3, convolution kernel number is 64;For the 1st layer of empty convolution, volume
Product core size is 5*3, and convolution kernel number is 64;For the 2nd layer of empty convolution, convolution kernel size is 3*3, and convolution kernel number is
128;For the 3rd layer of empty convolution, convolution kernel size is 3*3, and convolution kernel number is 128;For the 4th layer of empty convolution,
Convolution kernel size is 3*3, and convolution kernel number is 256;For the 5th layer of empty convolution, convolution kernel size is 3*3, convolution kernel
Number is 256.
Wherein, LSTM network is widely applied a kind of RNN (Recurrent Neural in acoustic model at present
Networks, Recognition with Recurrent Neural Network) structure.Compared to common RNN, LSTM controls information by well-designed door
Storage, output and input, while the gradient disappearance problem of common RNN can be avoided to a certain extent, so that LSTM network can
With effectively to clock signal it is long when correlation model.
In one possible implementation, the LSTM network in acoustic model generally comprises 3-5 LSTM layers, the reason is that
The promotion that more deeper networks of LSTM layer building tend not to bring performance is directly accumulated, the performance of model can be made instead
It is worse.It as an example, include 3 LSTM layers in LSTM network referring to Fig. 6.
Wherein, the place that LSTM is different from RNN essentially consists in: LSTM joined in the algorithm one judge information it is useful with
The structure of no " processor ", this " processor " effect is referred to as cell, i.e. LSTM cell.Wherein, a LSTM cell
In be placed three doors, respectively input gate, forget door and out gate.
In one possible implementation, in training objective acoustic model, can using dictionary as training corpus, with
Acoustic model with framework shown in Fig. 6 is initial model, with single-tone element, polynary phoneme, letter, word or Chinese character etc. for training
Target carrys out Optimized model with stochastic gradient descent algorithm, and the embodiment of the present disclosure is to this without specifically limiting.
In the embodiments of the present disclosure, fusion feature is handled based on target acoustical model, obtains the voice of audio frame
Recognition result includes the following steps:
4041, fusion feature is inputted into empty convolutional neural networks, successively by multiple empty convolutional layers to fusion spy
Sign is handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer.As shown in Figure 6, that is,
Empty convolution operation successively is carried out to the fusion feature by empty convolutional layer 0 to empty convolutional layer 5.
4042, using the first output result of the last one empty convolutional layer as the input of LSTM network, successively through excessive
A LSTM layers handles the first output result, wherein the input that upper one LSTM layers of output is next LSTM layers
Herein, in order to distinguish the output of empty convolutional neural networks and LSTM network as a result, by empty convolutional Neural
The output result of network is referred to as the first output as a result, the output result of LSTM network is referred to as the second output result.Such as Fig. 6
It is shown, i.e., once pass through the 0th layer of LSTM, the 1st layer of LSTM and the 2nd layer of LSTM, to the output results of empty convolutional neural networks into
Row processing.
4043, the speech recognition result of audio to be identified is determined based on the last one LSTM layers of the second output result.
Wherein, the empty convolutional neural networks of fusion feature input multiple-level stack can effectively be learnt to more abstract
First output result is output in the LSTM network of multiple-level stack, finally exports by output layer currently wait know later by feature
Corresponding phonetic (acoustics) classification of other audio frame.
Wherein, phonetic classification can be a polynary phoneme (senone), or can be a phoneme
(phone), letter, Chinese character or word either be can also be.
Implementation above determines the phonetic classification of the acoustic feature extracted from audio frame based on acoustic model,
In a kind of possible implementation, also language model and decoding technique can be used further to be converted into what user can understand
Text, the embodiment of the present disclosure is to this without specifically limiting.
The embodiment of the present disclosure provide method, in speech recognition process, can extract simultaneously audio frame FBank feature and
FBank feature and i-vector feature are carried out Fusion Features later by i-vector feature, and using fused feature as sound
Feature, which is input in acoustic model, carries out speech recognition, since fused feature being capable of effective expression speaker characteristic and letter
Road feature, therefore improve the accuracy rate of speech recognition.In addition, taking empty convolutional neural networks in acoustic model, utilize
The empty characteristic of empty convolutional layer can efficiently reduce calculation amount in the case where identical receptive field, compared to CNN, accelerate
The speed of speech recognition.
In conclusion the speech recognition effect for the audio recognition method that the embodiment of the present disclosure provides is preferable.
Fig. 7 is a kind of block diagram of speech recognition equipment shown according to an exemplary embodiment.Referring to Fig. 7, the device packet
Include acquiring unit 701, extraction unit 702, integrated unit 703 and processing unit 704.
Acquiring unit 701 is configured as obtaining audio frame to be identified;
Extraction unit 702 is configured to extract the Meier scale filter group feature of the audio frame and sounding is used
Family dope vector;
Integrated unit 703 is configured as to the Meier scale filter group feature and the sounding user information vector
Fusion treatment is carried out, fusion feature is obtained;
Processing unit 704 is configured as handling the fusion feature based on target acoustical model, obtains the sound
The speech recognition result of frequency frame, the target acoustical model include multiple empty convolutional layers.
The device that the embodiment of the present disclosure provides, in speech recognition process, the embodiment of the present disclosure can extract audio frame simultaneously
Meier scale filter group feature and sounding user information vector, later, by the carry out Fusion Features of the two types, and will
Fused feature carries out speech recognition as acoustic feature input acoustic model, since fused feature can be to speaker
Feature and channel characteristics carry out effective expression, therefore this kind of voice recognition mode improves the accuracy rate of speech recognition;In addition, sound
Learning in model includes multiple empty convolutional layers, can effectively be subtracted under identical receptive field using the empty characteristic of empty convolutional layer
Few calculation amount, therefore accelerate the speed of speech recognition, i.e., the audio recognition method that the embodiment of the present disclosure provides can effectively change
Kind speech recognition effect.
In one possible implementation, referring to Fig. 8, integrated unit 703, comprising:
First processing subelement 7031, is configured as carrying out standardization processing to the Meier scale filter group feature,
Obtain the first intermediate features;
Second processing subelement 7032 is configured as carrying out dimension conversion process to the sounding user information vector, obtain
To the second intermediate features, the dimension of second intermediate features is greater than the dimension of the sounding user information vector;
Third handles subelement 7033, is configured as carrying out standardization processing to second intermediate features, obtains third
Intermediate features;
Subelement 7034 is merged, is configured as carrying out at fusion first intermediate features and the third intermediate features
Reason, obtains the fusion feature.
In one possible implementation, the first processing subelement 7031, is additionally configured to based on the first BatchNorm
Layer, by the Meier scale filter group characteristic criterion be 0 to mean value and variance is 1, obtains first intermediate features;
Third handles subelement 7033, is configured as based on the 2nd BatchNorm layers, by the second intermediate features specification
Change is 0 to mean value and variance is 1, obtains the third intermediate features.
In one possible implementation, the target acoustical model includes empty convolutional neural networks and LSTM net
Network, the cavity convolutional neural networks include the multiple empty convolutional layer, and the LSTM network includes LSTM layers multiple;
Processing unit 704 is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively pass through
The multiple cavity convolutional layer handles the fusion feature, wherein the output of upper one empty convolutional layer is next
The input of empty convolutional layer;Using the first output result of the last one empty convolutional layer as the input of the LSTM network, according to
It is secondary that the first output result is handled by the multiple LSTM layers, wherein upper one LSTM layers of output is next
A LSTM layers of input;Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
In one possible implementation, merge subelement 7034, be additionally configured to first intermediate features and
The third intermediate features carry out column exchange processing, obtain the fusion feature;Or, based on weight matrix among described first
Feature and the third intermediate features are weighted conversion process, obtain the fusion feature.
All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination
It repeats one by one.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 9 is a kind of structural block diagram for speech recognition equipment that the embodiment of the present disclosure provides, which can be service
Device.The device 900 can generate bigger difference because configuration or performance are different, may include one or more processors
(central processing units, CPU) 901 and one or more memory 902, wherein the memory
At least one instruction is stored in 902, at least one instruction is loaded by the processor 901 and executed above-mentioned each to realize
The audio recognition method that a embodiment of the method provides.Certainly, the device can also have wired or wireless network interface, keyboard with
And the components such as input/output interface, to carry out input and output, which can also include other for realizing functions of the equipments
Component, this will not be repeated here.
In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the memory for example including instruction,
Above-metioned instruction can be executed by the processor in terminal to complete the audio recognition method in above-described embodiment.For example, the calculating
Machine readable storage medium storing program for executing can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices
Deng.
Figure 10 shows the structural block diagram of the device 1000 of one exemplary embodiment of disclosure offer.The device 1000 can
To be mobile terminal.
In general, device 1000 includes: processor 1001 and memory 1002.
Processor 1001 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place
Reason device 1001 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field-
Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed
Logic array) at least one of example, in hardware realize.Processor 1001 also may include primary processor and coprocessor, master
Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing
Unit, central processing unit);Coprocessor is the low power processor for being handled data in the standby state.?
In some embodiments, processor 1001 can be integrated with GPU (Graphics Processing Unit, image processor),
GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 1001 can also be wrapped
AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning
Calculating operation.
Memory 1002 may include one or more computer readable storage mediums, which can
To be non-transient.Memory 1002 may also include high-speed random access memory and nonvolatile memory, such as one
Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 1002 can
Storage medium is read for storing at least one instruction, at least one instruction for performed by processor 1001 to realize this public affairs
Open the audio recognition method that middle embodiment of the method provides.
In some embodiments, device 1000 is also optional includes: peripheral device interface 1003 and at least one periphery are set
It is standby.It can be connected by bus or signal wire between processor 1001, memory 1002 and peripheral device interface 1003.It is each outer
Peripheral equipment can be connected by bus, signal wire or circuit board with peripheral device interface 1003.Specifically, peripheral equipment includes:
In radio circuit 1004, touch display screen 1005, camera 1006, voicefrequency circuit 1007, positioning component 1008 and power supply 1009
At least one.
Peripheral device interface 1003 can be used for I/O (Input/Output, input/output) is relevant outside at least one
Peripheral equipment is connected to processor 1001 and memory 1002.In some embodiments, processor 1001, memory 1002 and periphery
Equipment interface 1003 is integrated on same chip or circuit board;In some other embodiments, processor 1001, memory
1002 and peripheral device interface 1003 in any one or two can be realized on individual chip or circuit board, this implementation
Example is not limited this.
Radio circuit 1004 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal.
Radio circuit 1004 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 1004 is by telecommunications
Number being converted to electromagnetic signal is sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit
1004 include: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, volume solution
Code chipset, user identity module card etc..Radio circuit 1004 can by least one wireless communication protocol come with it is other
Terminal is communicated.The wireless communication protocol includes but is not limited to: WWW, Metropolitan Area Network (MAN), Intranet, each third generation mobile communication network
(2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some implementations
In example, radio circuit 1004 can also include that NFC (Near Field Communication, wireless near field communication) is related
Circuit, the disclosure are not limited this.
Display screen 1005 is for showing UI (User Interface, user interface).The UI may include figure, text,
Icon, video and its their any combination.When display screen 1005 is touch display screen, display screen 1005 also there is acquisition to exist
The ability of the touch signal on the surface or surface of display screen 1005.The touch signal can be used as control signal and be input to place
Reason device 1001 is handled.At this point, display screen 1005 can be also used for providing virtual push button and/or dummy keyboard, it is also referred to as soft to press
Button and/or soft keyboard.In some embodiments, display screen 1005 can be one, and the front panel of device 1000 is arranged;Another
In a little embodiments, display screen 1005 can be at least two, be separately positioned on the different surfaces of device 1000 or in foldover design;
In still other embodiments, display screen 1005 can be flexible display screen, is arranged on the curved surface of device 1000 or folds
On face.Even, display screen 1005 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 1005 can be with
Using LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode,
Organic Light Emitting Diode) etc. materials preparation.
CCD camera assembly 1006 is for acquiring image or video.Optionally, CCD camera assembly 1006 includes front camera
And rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.?
In some embodiments, rear camera at least two is that main camera, depth of field camera, wide-angle camera, focal length are taken the photograph respectively
As any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide
Pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are realized in camera fusion in angle
Shooting function.In some embodiments, CCD camera assembly 1006 can also include flash lamp.Flash lamp can be monochromatic temperature flash of light
Lamp is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for
Light compensation under different-colour.
Voicefrequency circuit 1007 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and
It converts sound waves into electric signal and is input to processor 1001 and handled, or be input to radio circuit 1004 to realize that voice is logical
Letter.For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of device 1000 to be multiple.
Microphone can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 1001 or radio frequency will to be come from
The electric signal of circuit 1004 is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramics loudspeaking
Device.When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, can also be incited somebody to action
Electric signal is converted to the sound wave that the mankind do not hear to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 1007 may be used also
To include earphone jack.
Positioning component 1008 is used for the current geographic position of positioning device 1000, to realize navigation or LBS (Location
Based Service, location based service).Positioning component 1008 can be the GPS (Global based on the U.S.
Positioning System, global positioning system), China dipper system or Russia Galileo system positioning group
Part.
Power supply 1009 is used to be powered for the various components in device 1000.Power supply 1009 can be alternating current, direct current
Electricity, disposable battery or rechargeable battery.When power supply 1009 includes rechargeable battery, which can be line charge
Battery or wireless charging battery.Wired charging battery is the battery to be charged by Wireline, and wireless charging battery is to pass through
The battery of wireless coil charging.The rechargeable battery can be also used for supporting fast charge technology.
In some embodiments, device 1000 further includes having one or more sensors 1010.One or more sensing
Device 1010 includes but is not limited to: acceleration transducer 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensing
Device 1014, optical sensor 1015 and proximity sensor 1016.
Acceleration transducer 1011 can detecte the acceleration in three reference axis of the coordinate system established with device 1000
Size.For example, acceleration transducer 1011 can be used for detecting component of the acceleration of gravity in three reference axis.Processor
The 1001 acceleration of gravity signals that can be acquired according to acceleration transducer 1011, control touch display screen 1005 with transverse views
Or longitudinal view carries out the display of user interface.Acceleration transducer 1011 can be also used for game or the exercise data of user
Acquisition.
Gyro sensor 1012 can detecte body direction and the rotational angle of device 1000, gyro sensor 1012
Acquisition user can be cooperateed with to act the 3D of device 1000 with acceleration transducer 1011.Processor 1001 is according to gyro sensors
The data that device 1012 acquires, following function may be implemented: action induction (for example changing UI according to the tilt operation of user) is clapped
Image stabilization, game control and inertial navigation when taking the photograph.
The lower layer of side frame and/or touch display screen 1005 in device 1000 can be set in pressure sensor 1013.When
When the side frame of device 1000 is arranged in pressure sensor 1013, user can detecte to the gripping signal of device 1000, by
Reason device 1001 carries out right-hand man's identification or prompt operation according to the gripping signal that pressure sensor 1013 acquires.Work as pressure sensor
1013 when being arranged in the lower layer of touch display screen 1005, is grasped by processor 1001 according to pressure of the user to touch display screen 1005
Make, realization controls the operability control on the interface UI.Operability control include button control, scroll bar control,
At least one of icon control, menu control.
Fingerprint sensor 1014 is used to acquire the fingerprint of user, is collected by processor 1001 according to fingerprint sensor 1014
Fingerprint recognition user identity, alternatively, by fingerprint sensor 1014 according to the identity of collected fingerprint recognition user.Knowing
Not Chu the identity of user when being trusted identity, authorize the user to execute relevant sensitive operation by processor 1001, which grasps
Make to include solving lock screen, checking encryption information, downloading software, payment and change setting etc..Fingerprint sensor 1014 can be set
Set the front, the back side or side of device 1000.When being provided with physical button or manufacturer Logo on device 1000, fingerprint sensor
1014 can integrate with physical button or manufacturer Logo.
Optical sensor 1015 is for acquiring ambient light intensity.In one embodiment, processor 1001 can be according to light
The ambient light intensity that sensor 1015 acquires is learned, the display brightness of touch display screen 1005 is controlled.Specifically, work as ambient light intensity
When higher, the display brightness of touch display screen 1005 is turned up;When ambient light intensity is lower, the aobvious of touch display screen 1005 is turned down
Show brightness.In another embodiment, the ambient light intensity that processor 1001 can also be acquired according to optical sensor 1015, is moved
The acquisition parameters of state adjustment CCD camera assembly 1006.
Proximity sensor 1016, also referred to as range sensor are generally arranged at the front panel of device 1000.Proximity sensor
1016 for acquiring the distance between the front of user Yu device 1000.In one embodiment, when proximity sensor 1016 is examined
When measuring the distance between the front of user and device 1000 and gradually becoming smaller, by processor 1001 control touch display screen 1005 from
Bright screen state is switched to breath screen state;When proximity sensor 1016 detect the distance between front of user and device 1000 by
When gradual change is big, touch display screen 1005 is controlled by processor 1001 and is switched to bright screen state from breath screen state.
It, can be with it will be understood by those skilled in the art that the restriction of the not structure twin installation 1000 of structure shown in Figure 10
Including than illustrating more or fewer components, perhaps combining certain components or being arranged using different components.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.The disclosure is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.
Claims (10)
1. a kind of audio recognition method characterized by comprising
Obtain audio frame to be identified;
The Meier scale filter group feature and sounding user information vector of the audio frame are extracted respectively;
Fusion treatment is carried out to the Meier scale filter group feature and the sounding user information vector, it is special to obtain fusion
Sign;
The fusion feature is handled based on target acoustical model, obtains the speech recognition result of the audio frame, it is described
Target acoustical model includes multiple empty convolutional layers.
2. audio recognition method according to claim 1, which is characterized in that described special to the Meier scale filter group
The sounding user information vector of seeking peace carries out fusion treatment, comprising:
Standardization processing is carried out to the Meier scale filter group feature, obtains the first intermediate features;
Dimension conversion process is carried out to the sounding user information vector, obtains the second intermediate features, second intermediate features
Dimension be greater than the sounding user information vector dimension;
Standardization processing is carried out to second intermediate features, obtains third intermediate features;
Fusion treatment is carried out to first intermediate features and the third intermediate features, obtains the fusion feature.
3. audio recognition method according to claim 2, which is characterized in that described special to the Meier scale filter group
Sign carries out standardization processing, obtains the first intermediate features, comprising:
It is 0 and side by the Meier scale filter group characteristic criterion to mean value based on first BatchNorm layers of standardization
Difference is 1, obtains first intermediate features;
It is described that standardization processing is carried out to second intermediate features, obtain third intermediate features, comprising:
Based on the 2nd BatchNorm layers, it is 0 and variance is 1 that second intermediate features, which are standardized to mean value, obtains described
Three intermediate features.
4. audio recognition method according to claim 1, which is characterized in that the target acoustical model includes empty convolution
Neural network and shot and long term remember LSTM network, and the cavity convolutional neural networks include the multiple empty convolutional layer, described
LSTM network includes LSTM layers multiple;
It is described that the fusion feature is handled based on target acoustical model, the speech recognition result of the audio frame is obtained,
Include:
By the fusion feature input empty convolutional neural networks, successively melt by the multiple empty convolutional layer to described
It closes feature to be handled, wherein the output of upper one empty convolutional layer is the input of next empty convolutional layer;
Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively by the multiple
LSTM layers handle the first output result, wherein upper one LSTM layers of output is next LSTM layers of input;
Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
5. audio recognition method according to claim 2, which is characterized in that described to first intermediate features and described
Third intermediate features carry out fusion treatment, obtain the fusion feature, comprising:
Column exchange processing is carried out to first intermediate features and the third intermediate features, obtains the fusion feature;Or,
Conversion process is weighted to first intermediate features and the third intermediate features based on weight matrix, is obtained described
Fusion feature.
6. a kind of speech recognition equipment characterized by comprising
Acquiring unit is configured as obtaining audio frame to be identified;
Extraction unit is configured to extract the Meier scale filter group feature of the audio frame and sounding user information arrow
Amount;
Integrated unit is configured as merging the Meier scale filter group feature and the sounding user information vector
Processing, obtains fusion feature;
Processing unit is configured as handling the fusion feature based on target acoustical model, obtains the audio frame
Speech recognition result, the target acoustical model include multiple empty convolutional layers.
7. speech recognition equipment according to claim 6, which is characterized in that the integrated unit, comprising:
First processing subelement, is configured as carrying out standardization processing to the Meier scale filter group feature, obtains first
Intermediate features;
Second processing subelement is configured as carrying out dimension conversion process to the sounding user information vector, obtain in second
Between feature, the dimensions of second intermediate features is greater than the dimension of the sounding user information vector;
Third handles subelement, is configured as carrying out standardization processing to second intermediate features, obtains third intermediate features;
Subelement is merged, is configured as carrying out fusion treatment to first intermediate features and the third intermediate features, obtain
The fusion feature.
8. speech recognition equipment according to claim 6, which is characterized in that the target acoustical model includes empty convolution
Neural network and shot and long term remember LSTM network, and the cavity convolutional neural networks include the multiple empty convolutional layer, described
LSTM network includes LSTM layers multiple;
The processing unit is additionally configured to inputting the fusion feature into the empty convolutional neural networks, successively passes through institute
It states multiple empty convolutional layers to handle the fusion feature, wherein the output of upper one empty convolutional layer is next sky
The input of hole convolutional layer;Using the first output result of the last one empty convolutional layer as the input of the LSTM network, successively
The first output result is handled by the multiple LSTM layers, wherein upper one LSTM layers of output is next
LSTM layers of input;Institute's speech recognition result is determined based on the last one LSTM layers of the second output result.
9. a kind of speech recognition equipment characterized by comprising
Processor;
Memory for storage processor executable instruction;
Wherein, the processor is configured to: execute speech recognition described in any claim in the claims 1 to 5
Method.
10. a kind of non-transitorycomputer readable storage medium, which is characterized in that when the instruction in the storage medium is by voice
When the processor of identification device executes, so that speech recognition equipment is able to carry out any claim in the claims 1 to 5
The audio recognition method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910418620.XA CN110047468B (en) | 2019-05-20 | 2019-05-20 | Speech recognition method, apparatus and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910418620.XA CN110047468B (en) | 2019-05-20 | 2019-05-20 | Speech recognition method, apparatus and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110047468A true CN110047468A (en) | 2019-07-23 |
CN110047468B CN110047468B (en) | 2022-01-25 |
Family
ID=67282705
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910418620.XA Active CN110047468B (en) | 2019-05-20 | 2019-05-20 | Speech recognition method, apparatus and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110047468B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110534101A (en) * | 2019-08-27 | 2019-12-03 | 华中师范大学 | A kind of mobile device source discrimination and system based on multimodality fusion depth characteristic |
CN111739517A (en) * | 2020-07-01 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and medium |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN112133288A (en) * | 2020-09-22 | 2020-12-25 | 中用科技有限公司 | Method, system and equipment for processing voice to character |
CN113327616A (en) * | 2021-06-02 | 2021-08-31 | 广东电网有限责任公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN114446291A (en) * | 2020-11-04 | 2022-05-06 | 阿里巴巴集团控股有限公司 | Voice recognition method and device, intelligent sound box, household appliance, electronic equipment and medium |
CN115426582A (en) * | 2022-11-06 | 2022-12-02 | 江苏米笛声学科技有限公司 | Earphone audio processing method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU5205101A (en) * | 2000-05-04 | 2001-11-12 | Faculte Polytechnique De Mons | Robust parameters for noisy speech recognition |
CN101599271A (en) * | 2009-07-07 | 2009-12-09 | 华中科技大学 | A kind of recognition methods of digital music emotion |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107785015A (en) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method and device |
US10026397B2 (en) * | 2013-12-10 | 2018-07-17 | Google Llc | Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers |
CN108417201A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The more speaker's identity recognition methods of single channel and system |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN109377984A (en) * | 2018-11-22 | 2019-02-22 | 北京中科智加科技有限公司 | A kind of audio recognition method and device based on ArcFace |
CN109635936A (en) * | 2018-12-29 | 2019-04-16 | 杭州国芯科技股份有限公司 | A kind of neural networks pruning quantization method based on retraining |
-
2019
- 2019-05-20 CN CN201910418620.XA patent/CN110047468B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU5205101A (en) * | 2000-05-04 | 2001-11-12 | Faculte Polytechnique De Mons | Robust parameters for noisy speech recognition |
CN101599271A (en) * | 2009-07-07 | 2009-12-09 | 华中科技大学 | A kind of recognition methods of digital music emotion |
US10026397B2 (en) * | 2013-12-10 | 2018-07-17 | Google Llc | Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers |
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN107785015A (en) * | 2016-08-26 | 2018-03-09 | 阿里巴巴集团控股有限公司 | A kind of audio recognition method and device |
CN107331384A (en) * | 2017-06-12 | 2017-11-07 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN108417201A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The more speaker's identity recognition methods of single channel and system |
CN108492820A (en) * | 2018-03-20 | 2018-09-04 | 华南理工大学 | Chinese speech recognition method based on Recognition with Recurrent Neural Network language model and deep neural network acoustic model |
CN109377984A (en) * | 2018-11-22 | 2019-02-22 | 北京中科智加科技有限公司 | A kind of audio recognition method and device based on ArcFace |
CN109635936A (en) * | 2018-12-29 | 2019-04-16 | 杭州国芯科技股份有限公司 | A kind of neural networks pruning quantization method based on retraining |
Non-Patent Citations (4)
Title |
---|
KE TAN .ETAL: "GATED RESIDUAL NETWORKS WITH DILATED CONVOLUTIONS FOR SUPERVISED speech separation", 《SPEECH AND SIGNAL PROCESSING 》 * |
TARA N. SAINATH .ETAL: "convolutional,long short-term memory,fully connected deep networks", 《 SPEECH AND SIGNAL PROCESSING 》 * |
孙念: "基于多特征i_vector的短语音说话人识别算法", 《计算机应用》 * |
张玉清等: "深度学习应用于网络安全的现状、趋势与展望", 《计算机研究与发展》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110534101A (en) * | 2019-08-27 | 2019-12-03 | 华中师范大学 | A kind of mobile device source discrimination and system based on multimodality fusion depth characteristic |
CN110534101B (en) * | 2019-08-27 | 2022-02-22 | 华中师范大学 | Mobile equipment source identification method and system based on multimode fusion depth features |
CN111739517A (en) * | 2020-07-01 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Speech recognition method, speech recognition device, computer equipment and medium |
CN111739517B (en) * | 2020-07-01 | 2024-01-30 | 腾讯科技(深圳)有限公司 | Speech recognition method, device, computer equipment and medium |
CN111816166A (en) * | 2020-07-17 | 2020-10-23 | 字节跳动有限公司 | Voice recognition method, apparatus, and computer-readable storage medium storing instructions |
CN112133288A (en) * | 2020-09-22 | 2020-12-25 | 中用科技有限公司 | Method, system and equipment for processing voice to character |
CN114446291A (en) * | 2020-11-04 | 2022-05-06 | 阿里巴巴集团控股有限公司 | Voice recognition method and device, intelligent sound box, household appliance, electronic equipment and medium |
CN113327616A (en) * | 2021-06-02 | 2021-08-31 | 广东电网有限责任公司 | Voiceprint recognition method and device, electronic equipment and storage medium |
CN115426582A (en) * | 2022-11-06 | 2022-12-02 | 江苏米笛声学科技有限公司 | Earphone audio processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110047468B (en) | 2022-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110047468A (en) | Audio recognition method, device and storage medium | |
CN108615526B (en) | Method, device, terminal and storage medium for detecting keywords in voice signal | |
US11482208B2 (en) | Method, device and storage medium for speech recognition | |
CN110544488B (en) | Method and device for separating multi-person voice | |
CN107481718B (en) | Audio recognition method, device, storage medium and electronic equipment | |
WO2019105285A1 (en) | Facial attribute recognition method, electronic device, and storage medium | |
CN111063342B (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN110059661A (en) | Action identification method, man-machine interaction method, device and storage medium | |
WO2021135628A1 (en) | Voice signal processing method and speech separation method | |
CN111696570B (en) | Voice signal processing method, device, equipment and storage medium | |
CN110379430A (en) | Voice-based cartoon display method, device, computer equipment and storage medium | |
CN106030440A (en) | Smart circular audio buffer | |
US11482237B2 (en) | Method and terminal for reconstructing speech signal, and computer storage medium | |
CN110322760B (en) | Voice data generation method, device, terminal and storage medium | |
CN110059652A (en) | Face image processing process, device and storage medium | |
CN110018970A (en) | Cache prefetching method, apparatus, equipment and computer readable storage medium | |
CN111696532A (en) | Speech recognition method, speech recognition device, electronic device and storage medium | |
WO2021052306A1 (en) | Voiceprint feature registration | |
CN111581958A (en) | Conversation state determining method and device, computer equipment and storage medium | |
CN111863020A (en) | Voice signal processing method, device, equipment and storage medium | |
CN112289325A (en) | Voiceprint recognition method and device | |
CN110390953A (en) | It utters long and high-pitched sounds detection method, device, terminal and the storage medium of voice signal | |
CN109961802A (en) | Sound quality comparative approach, device, electronic equipment and storage medium | |
CN112116908B (en) | Wake-up audio determining method, device, equipment and storage medium | |
CN110728993A (en) | Voice change identification method and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |