CN118230722B - Intelligent voice recognition method and system based on AI - Google Patents
Intelligent voice recognition method and system based on AI Download PDFInfo
- Publication number
- CN118230722B CN118230722B CN202410634545.1A CN202410634545A CN118230722B CN 118230722 B CN118230722 B CN 118230722B CN 202410634545 A CN202410634545 A CN 202410634545A CN 118230722 B CN118230722 B CN 118230722B
- Authority
- CN
- China
- Prior art keywords
- waveform
- semantic
- sequence
- voice signal
- feature vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 25
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 230000003044 adaptive effect Effects 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 206
- 230000003993 interaction Effects 0.000 claims description 39
- 239000012634 fragment Substances 0.000 claims description 37
- 230000004927 fusion Effects 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 230000003213 activating effect Effects 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 235000004257 Cordia myxa Nutrition 0.000 claims description 2
- 244000157795 Cordia myxa Species 0.000 claims description 2
- 238000012545 processing Methods 0.000 abstract description 13
- 230000008569 process Effects 0.000 abstract description 11
- 230000007246 mechanism Effects 0.000 abstract description 7
- 239000000284 extract Substances 0.000 abstract description 6
- 238000013135 deep learning Methods 0.000 abstract description 4
- 238000011176 pooling Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 230000004907 flux Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The application discloses an AI-based intelligent voice recognition method and system, which relate to the field of intelligent voice recognition, and process and analyze voice signals to be recognized by utilizing the strong feature extraction capability of deep learning network models such as Convolutional Neural Network (CNN) and graph convolutional neural network (GCN) and the processing capability of structural data, extract local detail waveform features and global context semantic association features of the voice signals, and introduce an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.
Description
Technical Field
The application relates to the field of intelligent voice recognition, and more particularly, to an AI-based intelligent voice recognition method and system.
Background
With the rapid development of artificial intelligence technology, intelligent voice recognition technology has become an important research direction in the field of human-computer interaction.
Traditional speech recognition systems rely on manual feature extraction and statistical models, and the methods often have limitations in processing complex speech signals, such as insufficient robustness to environmental noise, and poor adaptability to individual differences of speakers.
Therefore, an optimized intelligent speech recognition method and system are desired.
Disclosure of Invention
The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides an AI-based intelligent voice recognition method and system, which process and analyze voice signals to be recognized by utilizing the strong feature extraction capability and the processing capability of structural data of deep learning network models such as Convolutional Neural Networks (CNNs) and graph convolutional neural networks (GCNs), extract local detail waveform features and global context semantic association features of the voice signals, and introduce an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.
According to one aspect of the present application, there is provided an AI-based intelligent speech recognition method, including: acquiring a voice signal to be recognized; performing signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments; extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain a global salient voice signal waveform feature vector; and determining a speech recognition result based on the globally significant speech signal waveform feature vector.
According to another aspect of the present application, there is provided an AI-based intelligent speech recognition system, including: the signal acquisition module is used for acquiring a voice signal to be recognized; the signal segmentation module is used for carrying out signal segmentation on the voice signal to be recognized so as to obtain a sequence of voice signal fragments; the global waveform semantic feature extraction module is used for extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; the self-adaptive attention weight fusion module is used for enabling the sequence of the context voice signal segment waveform semantic feature vectors to pass through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors; the method also comprises a recognition result determining module for determining a voice recognition result based on the globally significant voice signal waveform characteristic vector.
Compared with the prior art, the intelligent voice recognition method and system based on AI provided by the application processes and analyzes the voice signal to be recognized by utilizing the strong feature extraction capability of deep learning network models such as Convolutional Neural Network (CNN) and graph convolutional neural network (GCN) and the processing capability of structural data, extracts local detail waveform features and global context semantic association features of the voice signal, and introduces an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.
FIG. 1 is a flow chart of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;
FIG. 2 is a method diagram of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;
FIG. 3 is a flowchart of sub-step S3 of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;
FIG. 4 is a flowchart of sub-step S4 of the AI-based intelligent speech recognition method in accordance with an embodiment of the application;
fig. 5 is a block diagram of an AI-based intelligent speech recognition system, in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.
Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.
A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.
Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
In the field of intelligent speech recognition, it is a core task to accurately recognize speech signals and convert them into text information. However, the diversity and complexity of speech signals makes this task challenging. The speech signal is not only affected by factors such as speaker, speech speed, emotion, etc., but may also be disturbed by background noise. Traditional speech recognition systems rely on manual feature extraction and statistical models, and the methods often have limitations in processing complex speech signals, such as insufficient robustness to environmental noise, and poor adaptability to individual differences of speakers. Therefore, an optimized intelligent speech recognition method and system are desired.
In the technical scheme of the application, an intelligent voice recognition method based on AI is provided. Fig. 1 is a flowchart of an AI-based intelligent speech recognition method according to an embodiment of the present application. Fig. 2 is a method diagram of an AI-based intelligent speech recognition method according to an embodiment of the present application. As shown in fig. 1 and 2, the AI-based intelligent voice recognition method according to an embodiment of the present application includes the steps of: s1, acquiring a voice signal to be recognized; s2, carrying out signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments; s3, extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; s4, passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors; and S5, determining a voice recognition result based on the globally significant voice signal waveform characteristic vector.
In particular, the S1, a speech signal to be recognized is acquired. The speech signal to be identified is used as an important information source, carries language information of a speaker, including vocabulary, grammar, intonation and the like, and is a key and a basis for realizing speech-to-text conversion. In the embodiment of the application, voice data of a user can be acquired in real time through a microphone, such as a smart phone, a notebook computer, a built-in microphone in smart home equipment and the like. In particular, in certain situations, such as recording podcasts or interviews, professional audio recording equipment may be used to obtain high quality speech signals.
In particular, the step S2 is to perform signal slicing on the speech signal to be recognized to obtain a sequence of speech signal segments. It is contemplated that in the practical application scenario of the present application the speech signal to be recognized is typically composed of a series of rapidly changing acoustic features. By slicing the speech signal to be recognized into shorter segments, the local characteristics of each segment, such as frequency, amplitude, etc., can be analyzed more efficiently. Meanwhile, the long signal is divided into a plurality of short segments, so that the data volume of single processing can be reduced, and the computational complexity and the resource consumption are reduced. More importantly, in noisy environments, the impact of noise on the overall speech recognition process can be reduced by signal slicing, mainly because ambient noise may affect only some segments, but not all.
Specifically, the step S3 is to perform global waveform semantic feature extraction on the sequence of the speech signal segments to obtain a sequence of waveform semantic feature vectors of the contextual speech signal segments. In particular, in one specific example of the present application, as shown in fig. 3, the S3 includes: s31, respectively passing each voice signal segment in the sequence of voice signal segments through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of voice signal segment waveform semantic feature vectors; s32, calculating the hash similarity between any two voice signal segment waveform semantic feature vectors in the sequence of the voice signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix; s33, passing the sequence of the voice signal segment waveform semantic feature vectors and the segment waveform consistency topological matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context voice signal segment waveform semantic feature vectors.
Specifically, the step S31 is to pass each speech signal segment in the sequence of speech signal segments through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of speech signal segment waveform semantic feature vectors. Wherein a Convolutional Neural Network (CNN) model performs convolutional processing on each of the speech signal segments by convolutional collation to capture local patterns in the speech signal, such as time-series fluctuation patterns of phonemes and tones of speech, by local receptive field grabbing. It is worth mentioning that Convolutional Neural Network (CNN) is a deep learning model, which is specially used for processing data having a grid structure, such as images and voices. The core idea of CNN is to extract the features of the input data by convolution operation and to perform high-level representation and abstraction of the features by layer-by-layer stacking. The following are the basic components and working principles of CNN: convolution layer: the convolutional layer is the core component of the CNN for extracting features of the input data. It performs a convolution operation on the input data by applying a set of learnable convolution kernels (filters). The convolution operation may capture local patterns and features in the input data and generate a series of feature maps; activation function: after the convolutional layer, a nonlinear activation function, such as ReLU, is typically applied. The activation function introduces nonlinear features that enable the network to learn more complex patterns and representations; pooling layer: the pooling layer is used to reduce the size and number of parameters of the feature map and extract the most important features. Common pooling operations include maximum pooling and average pooling; full tie layer: after passing through a series of convolution and pooling layers, some fully connected layers are typically added. The fully connected layer converts the feature mapping of the previous layer into an output result, such as classification or regression; dropout: to prevent overfitting, dropout techniques are often used in CNNs. Dropout discards a part of neurons randomly in the training process so as to reduce the dependency relationship among the neurons and improve the generalization capability of the model. Through a back propagation algorithm, the CNN can automatically learn and extract the characteristics in the input data and optimize according to the training target. During training, the CNN adjusts the network parameters by minimizing the loss function so that the output results are as close as possible to the real labels.
Specifically, the step S32 calculates the hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix. Wherein hash similarity can be used to measure the similarity between two data objects. In the processing of the voice signal, the distance or similarity between the hash codes corresponding to the waveform semantic feature vectors of any two voice signal fragments can be compared and measured by calculating the hash similarity between the waveform semantic feature vectors of the two voice signal fragments, so that the fragment waveform consistency topology matrix is quickly constructed, and the overall structure mode and the inherent implicit interleaving relation of the voice signal are conveniently analyzed. In particular, since the hash similarity is more focused on the overall pattern rather than specific details in the calculation process, the calculation process has a certain degree of robustness to noise, so that the expression and the description of the overall interleaving pattern among different voice fragments are more convenient. In a specific example of the present application, a specific encoding process for calculating a hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix includes: firstly, calculating voice signal fragment waveform semantic hash coding feature vectors of the voice signal fragment waveform semantic feature vectors in the sequence of the voice signal fragment waveform semantic feature vectors by using a hash mapping function so as to obtain the sequence of the voice signal fragment waveform semantic hash coding feature vectors; then, calculating cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of the voice signal segment waveform semantic hash coding feature vectors to obtain a sequence of hash similarity; and then, two-dimensionally arranging the sequences of the hash similarity to obtain the segment waveform consistency topology matrix. More specifically, calculating the cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of voice signal segment waveform semantic hash coding feature vectors to obtain the sequence of hash similarities comprises: firstly, calculating the sum of the element-by-element point multiplication according to the position of the previous voice signal segment waveform semantic hash coding feature vector and the next voice signal segment waveform semantic hash coding feature vector in each two voice signal segment waveform semantic hash coding feature vectors to obtain an element-by-element association projection fusion value between signal segments; then, after Euclidean norms of the waveform semantic hash coding feature vector of the previous voice signal segment and the waveform semantic hash coding feature vector of the next voice signal segment are respectively calculated, the obtained two Euclidean norms are multiplied to obtain a semantic interaction fusion value between the signal segments; and finally, dividing the element-by-element association projection fusion value among the signal fragments by the semantic interaction fusion value among the signal fragments to obtain the hash similarity.
Specifically, the step S33 is to pass the sequence of the speech signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context speech signal segment waveform semantic feature vectors. Those of ordinary skill in the art will appreciate that the graph roll-up neural network (GCN) model is particularly adept at processing graph structure data, enabling capture of complex relationships between nodes. In speech recognition, this may be used to model global context relationships between speech signal segments, thereby improving the relevance between speech signal segments. More specifically, the graph roll-up neural network model achieves deep fusion of features by simultaneously processing local features of a speech signal (a sequence of segment waveform semantic feature vectors of the speech signal) and global structure information (the segment waveform consistency topology matrix), and learns long-distance dependency relationships existing in the speech signal, such as pronunciation of a word may be affected by a preceding or following word. This can enhance the semantic understanding capabilities of the model.
Accordingly, in one possible implementation, the sequence of speech signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix may be passed through a global context semantic encoder based on a graph-convolution neural network model to obtain the sequence of contextual speech signal segment waveform semantic feature vectors, for example: inputting the sequence of the voice signal fragment waveform semantic feature vector and the fragment waveform consistency topology matrix; taking a sequence of waveform semantic feature vectors of the voice signal fragments as node features, wherein each node represents a feature vector of one voice signal fragment; defining connection relations among nodes by using a segment waveform consistency topology matrix, wherein the topology matrix represents similarity or association among the nodes; defining a graph convolution neural network model for learning a representation of node features on a graph structure; constructing a global context semantic encoder for learning global context information and encoding semantic features; in the global context semantic encoder, feature propagation and integration will take place over the entire graph structure, propagating and integrating node features through multiple rounds of GCN operations; after processing by the global context semantic encoder, each node (i.e., the feature vector of each speech signal segment) obtains an updated representation to obtain a sequence of waveform semantic feature vectors of the context speech signal segment.
It should be noted that, in other specific examples of the present application, the global waveform semantic feature extraction may be performed on the sequence of speech signal segments in other manners to obtain a sequence of waveform semantic feature vectors of the context speech signal segments, for example: inputting a sequence of the speech signal segments; windowing each speech signal segment, typically using a short-time fourier transform or other time-frequency conversion method to convert the time-domain signal into a frequency-domain representation; global waveform semantic features are extracted from the frequency domain representation and may include spectral features, a sound spectrum envelope, a spectral centroid, a spectral flux, and the like. These features can capture various semantic information of the speech signal; in consideration of time sequence relativity of the voice signals, context information can be introduced on the basis of global waveform semantic feature extraction; combining the global waveform semantic feature extraction result of each voice signal segment into a feature vector to obtain a sequence of the contextual voice signal segment waveform semantic feature vectors.
In particular, the step S4 is to pass the sequence of the contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally salient speech signal waveform feature vector. The self-adaptive attention weight fusion network is essentially a weight probability distribution mechanism, namely, important contents are assigned with larger weights, and other contents are reduced in weight. Such a mechanism is more focused on finding useful information in the input data that is significantly related to the current data, mining the autocorrelation between waveform semantic feature vectors of individual contextual speech signal segments. The self-adaptive attention weight fusion network can learn semantic dependency and context correlation between voice signal segment data in different local time spans in sequence data of the context voice signal segment waveform semantic feature vector, and can more comprehensively understand dynamic features and semantic change modes in the voice signal segment data by carrying out autocorrelation modeling on the whole sequence. In particular, in one specific example of the present application, as shown in fig. 4, the S4 includes: s41, calculating waveform autocorrelation attention weights of waveform semantic feature vectors of all the contextual speech signal segments in the sequence of waveform semantic feature vectors of the contextual speech signal segments to obtain a sequence of waveform autocorrelation attention weights; s42, normalizing the sequence of waveform autocorrelation attention weights to obtain the sequence of waveform autocorrelation attention weight coefficients; and S43, calculating a vector-by-vector weighted sum of the sequence of the context speech signal segment waveform semantic feature vectors with the sequence of waveform autocorrelation attention weight coefficients as weights to obtain the globally salient speech signal waveform feature vectors.
Specifically, the step S41 calculates waveform autocorrelation attention weights of waveform semantic feature vectors of each of the contextual speech signal segment waveform semantic feature vectors in the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a sequence of waveform autocorrelation attention weights. In a specific example, first, determining a weight matrix of each contextual speech signal segment waveform semantic feature vector in the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a set of weight matrices; based on the set of weight matrices, respectively calculating matrix products between each group of corresponding weight matrices and the context voice signal segment waveform semantic feature vectors to obtain a sequence of weighted context voice signal segment waveform semantic feature vectors; then, adding each weighted context voice signal segment waveform semantic feature vector in the sequence of weighted context voice signal segment waveform semantic feature vectors with a bias vector to obtain a sequence of biased context voice signal segment waveform semantic feature vectors; activating the sequence of waveform semantic feature vectors of the biased context voice signal fragments to obtain the sequence of waveform semantic feature vectors of the nonlinear transformation context voice signal fragments; and finally, calculating the product between each nonlinear transformation context voice signal segment waveform semantic feature vector and the transpose vector of the preset reference feature vector in the sequence of nonlinear transformation context voice signal segment waveform semantic feature vectors to obtain the sequence of waveform autocorrelation attention weights. Activating the sequence of the waveform semantic feature vectors of the biased contextual speech signal segment to obtain the sequence of the waveform semantic feature vectors of the nonlinear transformation contextual speech signal segment, wherein the method comprises the following steps: activating the sequence of biased contextual speech signal segment waveform semantic feature vectors using Selu activation functions to obtain the sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors.
Specifically, the step S42 normalizes the sequence of waveform autocorrelation attention weights to obtain a sequence of waveform autocorrelation attention weight coefficients. It should be appreciated that normalization may eliminate the dimensional effects between different data, ensuring that the values of the individual waveform autocorrelation attention weights are within a similar range.
Specifically, the step S43 calculates a vector-wise weighted sum of the sequence of the waveform semantic feature vectors of the contextual speech signal segment with the sequence of the waveform autocorrelation attention weight coefficients as weights to obtain the globally salient speech signal waveform feature vector. It should be appreciated that the vector-wise weighted sum operation may sum feature vectors of individual speech signal segments by importance weights to obtain globally pronounced speech signal waveform feature vectors. This allows the information in the whole sequence to be integrated into one vector, better characterizing the whole speech signal sequence.
In summary, in the above embodiment, passing the sequence of the contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally significant speech signal waveform feature vector includes: using an adaptive attention weight fusion network to enable the sequence of the context voice signal segment waveform semantic feature vectors to pass through the adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors according to the following formula; wherein, the formula is:
;
;
;
;
wherein, For the first of the sequence of contextual speech signal segment waveform semantic feature vectorsSemantic feature vectors of waveforms of the individual context voice signal segments; And Respectively representing a transpose vector and a weight matrix of the preset reference feature vector,As a result of the offset vector,Representation ofThe function of the function is that,Is thatThe variables of the function are used to determine,For the kth waveform autocorrelation attention weight,Is the firstThe individual waveforms are auto-correlated with the attention weight,Is natural constantA kind of electronic deviceTo the power of the method,Attention weighting coefficients for waveform autocorrelation,AndAre all the super-parameters of the method,An exponential operation is represented by the formula,For the number of feature vectors in the sequence of waveform semantic feature vectors of the contextual speech signal segment,The speech signal waveform feature vectors are pronounced for the universe.
It should be noted that, in other specific examples of the present application, the sequence of the contextual speech signal segment waveform semantic feature vectors may also be passed through an adaptive attention weight fusion network to obtain a globally significant speech signal waveform feature vector in other manners, for example: inputting a sequence of waveform semantic feature vectors of the contextual speech signal segment; introducing an adaptive attention mechanism for learning the importance weight of each contextual speech signal segment waveform semantic feature vector; by calculating the attention weight of each vector, the network can automatically learn the attention degree of different voice signal fragments so as to better fuse information; according to the calculated attention weight, carrying out weighted fusion on the context voice signal segment waveform semantic feature vector so as to obtain a globally salient voice signal waveform feature vector; and carrying out weighted fusion on the waveform semantic feature vector of each context voice signal segment according to the attention weight so as to obtain the global salient voice signal waveform feature vector.
In particular, the S5 determines a speech recognition result based on the globally significant speech signal waveform feature vector. In particular, in one specific example of the present application, the globally significant speech signal waveform feature vector is passed through a decoder-based speech recognizer to obtain the speech recognition result. Preferably, passing the globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result specifically includes: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.
In order to promote the feature expression effect of the global-salified speech signal waveform feature vector in the complex feature expression dimension, in the above steps, the complex manifold structure of the global-salified speech signal waveform feature vector is reconstructed in the form of a global structure potential manifold dictionary based on the feature information mode and feature distribution mode of the global-salified speech signal waveform feature interaction mode expression vector, so as to promote the generation and the iteration understanding capability of the model in the iteration process for the manifold structure corresponding to the feature under the complex feature expression dimension, thereby promoting the speech recognition effect of the decoder based on the speech recognition result of the speech signal by using the global-salified speech signal waveform feature vector.
In summary, an AI-based intelligent speech recognition method according to an embodiment of the present application is explained that processes and analyzes a speech signal to be recognized by utilizing a strong feature extraction capability and a processing capability of structured data of a deep learning network model such as a Convolutional Neural Network (CNN) and a graph convolutional neural network (GCN), extracts local detail waveform features and global context semantic association features of the speech signal therefrom, and introduces an adaptive attention mechanism to integrate context information and develop important feature distribution, thereby ensuring accuracy of semantic recognition.
Further, an AI-based intelligent speech recognition system is also provided.
Fig. 5 is a block diagram of an AI-based intelligent speech recognition system, in accordance with an embodiment of the present application. As shown in fig. 5, the AI-based intelligent speech recognition system 300, according to an embodiment of the present application, includes: a signal acquisition module 310, configured to acquire a voice signal to be recognized; the signal slicing module 320 is configured to perform signal slicing on the speech signal to be identified to obtain a sequence of speech signal segments; the global waveform semantic feature extraction module 330 is configured to perform global waveform semantic feature extraction on the sequence of speech signal segments to obtain a sequence of context speech signal segment waveform semantic feature vectors; the adaptive attention weight fusion module 340 is configured to pass the sequence of the context speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a global salient speech signal waveform feature vector; and a recognition result determining module 350 for determining a speech recognition result based on the globally significant speech signal waveform feature vector.
As described above, the AI-based intelligent speech recognition system 300 according to an embodiment of the present application can be implemented in various wireless terminals, such as a server or the like having an AI-based intelligent speech recognition algorithm. In one possible implementation, the AI-based intelligent speech recognition system 300 in accordance with an embodiment of the application can be integrated into a wireless terminal as a software module and/or hardware module. For example, the AI-based intelligent speech recognition system 300 can be a software module in the operating system of the wireless terminal or can be an application developed for the wireless terminal; of course, the AI-based intelligent speech recognition system 300 could equally be one of a number of hardware modules of the wireless terminal.
In another example, the AI-based intelligent speech recognition system 300 can connect to the wireless terminal via a wired and/or wireless network and transmit the interactive information in a agreed-upon data format.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (8)
1. The intelligent voice recognition method based on the AI is characterized by comprising the following steps:
Acquiring a voice signal to be recognized;
performing signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments;
Extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors;
passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain a global salient voice signal waveform feature vector; and
Determining a speech recognition result based on the global-saliency speech signal waveform feature vectors;
determining a speech recognition result based on the globally significant speech signal waveform feature vector, comprising:
passing the global-saliency speech signal waveform feature vectors through a decoder-based speech recognizer to obtain the speech recognition result;
the step of passing the global significant voice signal waveform feature vector through a decoder-based voice recognizer to obtain a voice recognition result specifically comprises the following steps: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.
2. The AI-based intelligent speech recognition method of claim 1, wherein performing global waveform semantic feature extraction on the sequence of speech signal segments to obtain a sequence of contextual speech signal segment waveform semantic feature vectors comprises:
Each voice signal segment in the sequence of voice signal segments respectively passes through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of voice signal segment waveform semantic feature vectors;
Calculating the hash similarity between any two voice signal segment waveform semantic feature vectors in the sequence of the voice signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix;
And passing the sequence of the voice signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context voice signal segment waveform semantic feature vectors.
3. The AI-based intelligent speech recognition method of claim 2, wherein calculating the hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix comprises:
Calculating the voice signal segment waveform semantic hash coding feature vector of each voice signal segment waveform semantic feature vector in the sequence of the voice signal segment waveform semantic feature vector by using a hash mapping function so as to obtain the sequence of the voice signal segment waveform semantic hash coding feature vector;
calculating cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of the voice signal segment waveform semantic hash coding feature vectors to obtain a sequence of hash similarity; and
And carrying out two-dimensional arrangement on the sequences of the hash similarity to obtain the fragment waveform consistency topology matrix.
4. The AI-based intelligent speech recognition method of claim 3, wherein calculating cosine similarity between each two of the speech signal segment waveform semantic hash coding feature vectors in the sequence of speech signal segment waveform semantic hash coding feature vectors to obtain the sequence of hash similarities comprises:
Calculating the sum of the element-by-element point multiplication according to the position of the previous voice signal segment waveform semantic hash coding feature vector and the next voice signal segment waveform semantic hash coding feature vector in every two voice signal segment waveform semantic hash coding feature vectors to obtain an element-by-element association projection fusion value between signal segments;
Respectively calculating Euclidean norms of the waveform semantic hash coding feature vector of the previous voice signal segment and the waveform semantic hash coding feature vector of the next voice signal segment, and multiplying the obtained two Euclidean norms to obtain a semantic interaction fusion value between the signal segments; and
Dividing the element-by-element association projection fusion value among the signal fragments by the semantic interaction fusion value among the signal fragments to obtain the hash similarity.
5. The AI-based intelligent speech recognition method of claim 4, wherein passing the sequence of contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally salient speech signal waveform feature vector comprises:
Calculating waveform autocorrelation attention weights of waveform semantic feature vectors of all the contextual speech signal segments in the sequence of waveform semantic feature vectors of the contextual speech signal segments to obtain a sequence of waveform autocorrelation attention weights;
Normalizing the sequence of waveform autocorrelation attention weights to obtain a sequence of waveform autocorrelation attention weight coefficients; and
And calculating a vector-by-vector weighted sum of the sequence of contextual speech signal segment waveform semantic feature vectors with the sequence of waveform autocorrelation attention weight coefficients as weights to obtain the globally significant speech signal waveform feature vector.
6. The AI-based intelligent speech recognition method of claim 5, wherein calculating waveform autocorrelation attention weights for each of the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a sequence of waveform autocorrelation attention weights comprises:
determining weight matrixes of the waveform semantic feature vectors of the contextual speech signal segments in the sequence of the waveform semantic feature vectors of the contextual speech signal segments to obtain a set of weight matrixes;
Based on the set of weight matrices, respectively calculating matrix products between each group of corresponding weight matrices and the context voice signal segment waveform semantic feature vectors to obtain a sequence of weighted context voice signal segment waveform semantic feature vectors;
adding each weighted context voice signal segment waveform semantic feature vector in the sequence of weighted context voice signal segment waveform semantic feature vectors with a bias vector to obtain a sequence of biased context voice signal segment waveform semantic feature vectors;
activating the sequence of waveform semantic feature vectors of the biased context voice signal fragments to obtain the sequence of waveform semantic feature vectors of the nonlinear transformation context voice signal fragments; and
And calculating the product between each nonlinear transformation context voice signal segment waveform semantic feature vector and the transpose vector of the preset reference feature vector in the sequence of nonlinear transformation context voice signal segment waveform semantic feature vectors to obtain the sequence of waveform autocorrelation attention weights.
7. The AI-based intelligent speech recognition method of claim 6, wherein activating the sequence of biased contextual speech signal segment waveform semantic feature vectors to obtain a sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors comprises:
Activating the sequence of biased contextual speech signal segment waveform semantic feature vectors using Selu activation functions to obtain the sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors.
8. An AI-based intelligent speech recognition system, comprising:
the signal acquisition module is used for acquiring a voice signal to be recognized;
the signal segmentation module is used for carrying out signal segmentation on the voice signal to be recognized so as to obtain a sequence of voice signal fragments;
the global waveform semantic feature extraction module is used for extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors;
the self-adaptive attention weight fusion module is used for enabling the sequence of the context voice signal segment waveform semantic feature vectors to pass through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors;
the system further includes a recognition result determination module for determining a speech recognition result based on the globally significant speech signal waveform feature vector;
The identification result determining module is specifically configured to:
passing the global-saliency speech signal waveform feature vectors through a decoder-based speech recognizer to obtain the speech recognition result;
The identification result determining module is specifically configured to: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410634545.1A CN118230722B (en) | 2024-05-22 | 2024-05-22 | Intelligent voice recognition method and system based on AI |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410634545.1A CN118230722B (en) | 2024-05-22 | 2024-05-22 | Intelligent voice recognition method and system based on AI |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118230722A CN118230722A (en) | 2024-06-21 |
CN118230722B true CN118230722B (en) | 2024-08-13 |
Family
ID=91501214
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410634545.1A Active CN118230722B (en) | 2024-05-22 | 2024-05-22 | Intelligent voice recognition method and system based on AI |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118230722B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118646823B (en) * | 2024-08-15 | 2024-10-25 | 杭州贵禾科技有限公司 | Intelligent detection method and device for call quality and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818861A (en) * | 2021-02-02 | 2021-05-18 | 南京邮电大学 | Emotion classification method and system based on multi-mode context semantic features |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108197115B (en) * | 2018-01-26 | 2022-04-22 | 上海智臻智能网络科技股份有限公司 | Intelligent interaction method and device, computer equipment and computer readable storage medium |
CN111243579B (en) * | 2020-01-19 | 2022-10-14 | 清华大学 | Time domain single-channel multi-speaker voice recognition method and system |
CN111291534A (en) * | 2020-02-03 | 2020-06-16 | 苏州科技大学 | Global coding method for automatic summarization of Chinese long text |
CN113642674A (en) * | 2021-09-03 | 2021-11-12 | 贵州电网有限责任公司 | Multi-round dialogue classification method based on graph convolution neural network |
CN113936637A (en) * | 2021-10-18 | 2022-01-14 | 上海交通大学 | Voice self-adaptive completion system based on multi-mode knowledge graph |
CN114519809A (en) * | 2022-02-14 | 2022-05-20 | 复旦大学 | Audio-visual video analysis device and method based on multi-scale semantic network |
CN114863917A (en) * | 2022-04-02 | 2022-08-05 | 深圳市大梦龙途文化传播有限公司 | Game voice detection method, device, equipment and computer readable storage medium |
CN115048944B (en) * | 2022-08-16 | 2022-12-20 | 之江实验室 | Open domain dialogue reply method and system based on theme enhancement |
CN116341558A (en) * | 2022-12-06 | 2023-06-27 | 上海海事大学 | Multi-modal emotion recognition method and model based on multi-level graph neural network |
CN116153339A (en) * | 2022-12-06 | 2023-05-23 | 上海海事大学 | Speech emotion recognition method and device based on improved attention mechanism |
CN116959442B (en) * | 2023-07-29 | 2024-03-19 | 浙江阳宁科技有限公司 | Chip for intelligent switch panel and method thereof |
CN116895287A (en) * | 2023-08-04 | 2023-10-17 | 齐鲁工业大学(山东省科学院) | SHAP value-based depression voice phenotype analysis method |
-
2024
- 2024-05-22 CN CN202410634545.1A patent/CN118230722B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818861A (en) * | 2021-02-02 | 2021-05-18 | 南京邮电大学 | Emotion classification method and system based on multi-mode context semantic features |
Also Published As
Publication number | Publication date |
---|---|
CN118230722A (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
CN108597496B (en) | Voice generation method and device based on generation type countermeasure network | |
CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
CN118230722B (en) | Intelligent voice recognition method and system based on AI | |
CN108922543B (en) | Model base establishing method, voice recognition method, device, equipment and medium | |
CN113488058A (en) | Voiceprint recognition method based on short voice | |
CN111724770B (en) | Audio keyword identification method for generating confrontation network based on deep convolution | |
US5734793A (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
CN111899757A (en) | Single-channel voice separation method and system for target speaker extraction | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
Pardede et al. | Convolutional neural network and feature transformation for distant speech recognition | |
Zhang et al. | Temporal Transformer Networks for Acoustic Scene Classification. | |
Qi et al. | Exploiting low-rank tensor-train deep neural networks based on Riemannian gradient descent with illustrations of speech processing | |
KS et al. | Comparative performance analysis for speech digit recognition based on MFCC and vector quantization | |
Biagetti et al. | Speaker identification in noisy conditions using short sequences of speech frames | |
Nasrun et al. | Human emotion detection with speech recognition using Mel-frequency cepstral coefficient and support vector machine | |
Qais et al. | Deepfake audio detection with neural networks using audio features | |
Ong et al. | Speech emotion recognition with light gradient boosting decision trees machine | |
CN113488069A (en) | Method and device for quickly extracting high-dimensional voice features based on generative countermeasure network | |
Nijhawan et al. | Real time speaker recognition system for hindi words | |
Al-Thahab | Speech recognition based radon-discrete cosine transforms by Delta Neural Network learning rule | |
CN114512133A (en) | Sound object recognition method, sound object recognition device, server and storage medium | |
CN112951270A (en) | Voice fluency detection method and device and electronic equipment | |
Iswarya et al. | Speech query recognition for Tamil language using wavelet and wavelet packets | |
Kaewprateep et al. | Evaluation of small-scale deep learning architectures in Thai speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |