CN118230722B

CN118230722B - Intelligent voice recognition method and system based on AI

Info

Publication number: CN118230722B
Application number: CN202410634545.1A
Authority: CN
Inventors: 史琦; 沈林啸; 薛佳
Original assignee: Shaanxi Tuofang Information Technology Co ltd
Current assignee: Shaanxi Tuofang Information Technology Co ltd
Priority date: 2024-05-22
Filing date: 2024-05-22
Publication date: 2024-08-13
Anticipated expiration: 2044-05-22
Also published as: CN118230722A

Abstract

The application discloses an AI-based intelligent voice recognition method and system, which relate to the field of intelligent voice recognition, and process and analyze voice signals to be recognized by utilizing the strong feature extraction capability of deep learning network models such as Convolutional Neural Network (CNN) and graph convolutional neural network (GCN) and the processing capability of structural data, extract local detail waveform features and global context semantic association features of the voice signals, and introduce an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.

Description

Intelligent voice recognition method and system based on AI

Technical Field

The application relates to the field of intelligent voice recognition, and more particularly, to an AI-based intelligent voice recognition method and system.

Background

With the rapid development of artificial intelligence technology, intelligent voice recognition technology has become an important research direction in the field of human-computer interaction.

Traditional speech recognition systems rely on manual feature extraction and statistical models, and the methods often have limitations in processing complex speech signals, such as insufficient robustness to environmental noise, and poor adaptability to individual differences of speakers.

Therefore, an optimized intelligent speech recognition method and system are desired.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides an AI-based intelligent voice recognition method and system, which process and analyze voice signals to be recognized by utilizing the strong feature extraction capability and the processing capability of structural data of deep learning network models such as Convolutional Neural Networks (CNNs) and graph convolutional neural networks (GCNs), extract local detail waveform features and global context semantic association features of the voice signals, and introduce an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.

According to one aspect of the present application, there is provided an AI-based intelligent speech recognition method, including: acquiring a voice signal to be recognized; performing signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments; extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain a global salient voice signal waveform feature vector; and determining a speech recognition result based on the globally significant speech signal waveform feature vector.

According to another aspect of the present application, there is provided an AI-based intelligent speech recognition system, including: the signal acquisition module is used for acquiring a voice signal to be recognized; the signal segmentation module is used for carrying out signal segmentation on the voice signal to be recognized so as to obtain a sequence of voice signal fragments; the global waveform semantic feature extraction module is used for extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; the self-adaptive attention weight fusion module is used for enabling the sequence of the context voice signal segment waveform semantic feature vectors to pass through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors; the method also comprises a recognition result determining module for determining a voice recognition result based on the globally significant voice signal waveform characteristic vector.

Compared with the prior art, the intelligent voice recognition method and system based on AI provided by the application processes and analyzes the voice signal to be recognized by utilizing the strong feature extraction capability of deep learning network models such as Convolutional Neural Network (CNN) and graph convolutional neural network (GCN) and the processing capability of structural data, extracts local detail waveform features and global context semantic association features of the voice signal, and introduces an adaptive attention mechanism to integrate context information and display important feature distribution, thereby ensuring the accuracy of semantic recognition.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a flow chart of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;

FIG. 2 is a method diagram of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;

FIG. 3 is a flowchart of sub-step S3 of an AI-based intelligent speech recognition method in accordance with an embodiment of the application;

FIG. 4 is a flowchart of sub-step S4 of the AI-based intelligent speech recognition method in accordance with an embodiment of the application;

fig. 5 is a block diagram of an AI-based intelligent speech recognition system, in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

In the field of intelligent speech recognition, it is a core task to accurately recognize speech signals and convert them into text information. However, the diversity and complexity of speech signals makes this task challenging. The speech signal is not only affected by factors such as speaker, speech speed, emotion, etc., but may also be disturbed by background noise. Traditional speech recognition systems rely on manual feature extraction and statistical models, and the methods often have limitations in processing complex speech signals, such as insufficient robustness to environmental noise, and poor adaptability to individual differences of speakers. Therefore, an optimized intelligent speech recognition method and system are desired.

In the technical scheme of the application, an intelligent voice recognition method based on AI is provided. Fig. 1 is a flowchart of an AI-based intelligent speech recognition method according to an embodiment of the present application. Fig. 2 is a method diagram of an AI-based intelligent speech recognition method according to an embodiment of the present application. As shown in fig. 1 and 2, the AI-based intelligent voice recognition method according to an embodiment of the present application includes the steps of: s1, acquiring a voice signal to be recognized; s2, carrying out signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments; s3, extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors; s4, passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors; and S5, determining a voice recognition result based on the globally significant voice signal waveform characteristic vector.

In particular, the S1, a speech signal to be recognized is acquired. The speech signal to be identified is used as an important information source, carries language information of a speaker, including vocabulary, grammar, intonation and the like, and is a key and a basis for realizing speech-to-text conversion. In the embodiment of the application, voice data of a user can be acquired in real time through a microphone, such as a smart phone, a notebook computer, a built-in microphone in smart home equipment and the like. In particular, in certain situations, such as recording podcasts or interviews, professional audio recording equipment may be used to obtain high quality speech signals.

In particular, the step S2 is to perform signal slicing on the speech signal to be recognized to obtain a sequence of speech signal segments. It is contemplated that in the practical application scenario of the present application the speech signal to be recognized is typically composed of a series of rapidly changing acoustic features. By slicing the speech signal to be recognized into shorter segments, the local characteristics of each segment, such as frequency, amplitude, etc., can be analyzed more efficiently. Meanwhile, the long signal is divided into a plurality of short segments, so that the data volume of single processing can be reduced, and the computational complexity and the resource consumption are reduced. More importantly, in noisy environments, the impact of noise on the overall speech recognition process can be reduced by signal slicing, mainly because ambient noise may affect only some segments, but not all.

Specifically, the step S3 is to perform global waveform semantic feature extraction on the sequence of the speech signal segments to obtain a sequence of waveform semantic feature vectors of the contextual speech signal segments. In particular, in one specific example of the present application, as shown in fig. 3, the S3 includes: s31, respectively passing each voice signal segment in the sequence of voice signal segments through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of voice signal segment waveform semantic feature vectors; s32, calculating the hash similarity between any two voice signal segment waveform semantic feature vectors in the sequence of the voice signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix; s33, passing the sequence of the voice signal segment waveform semantic feature vectors and the segment waveform consistency topological matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context voice signal segment waveform semantic feature vectors.

Specifically, the step S31 is to pass each speech signal segment in the sequence of speech signal segments through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of speech signal segment waveform semantic feature vectors. Wherein a Convolutional Neural Network (CNN) model performs convolutional processing on each of the speech signal segments by convolutional collation to capture local patterns in the speech signal, such as time-series fluctuation patterns of phonemes and tones of speech, by local receptive field grabbing. It is worth mentioning that Convolutional Neural Network (CNN) is a deep learning model, which is specially used for processing data having a grid structure, such as images and voices. The core idea of CNN is to extract the features of the input data by convolution operation and to perform high-level representation and abstraction of the features by layer-by-layer stacking. The following are the basic components and working principles of CNN: convolution layer: the convolutional layer is the core component of the CNN for extracting features of the input data. It performs a convolution operation on the input data by applying a set of learnable convolution kernels (filters). The convolution operation may capture local patterns and features in the input data and generate a series of feature maps; activation function: after the convolutional layer, a nonlinear activation function, such as ReLU, is typically applied. The activation function introduces nonlinear features that enable the network to learn more complex patterns and representations; pooling layer: the pooling layer is used to reduce the size and number of parameters of the feature map and extract the most important features. Common pooling operations include maximum pooling and average pooling; full tie layer: after passing through a series of convolution and pooling layers, some fully connected layers are typically added. The fully connected layer converts the feature mapping of the previous layer into an output result, such as classification or regression; dropout: to prevent overfitting, dropout techniques are often used in CNNs. Dropout discards a part of neurons randomly in the training process so as to reduce the dependency relationship among the neurons and improve the generalization capability of the model. Through a back propagation algorithm, the CNN can automatically learn and extract the characteristics in the input data and optimize according to the training target. During training, the CNN adjusts the network parameters by minimizing the loss function so that the output results are as close as possible to the real labels.

Specifically, the step S32 calculates the hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix. Wherein hash similarity can be used to measure the similarity between two data objects. In the processing of the voice signal, the distance or similarity between the hash codes corresponding to the waveform semantic feature vectors of any two voice signal fragments can be compared and measured by calculating the hash similarity between the waveform semantic feature vectors of the two voice signal fragments, so that the fragment waveform consistency topology matrix is quickly constructed, and the overall structure mode and the inherent implicit interleaving relation of the voice signal are conveniently analyzed. In particular, since the hash similarity is more focused on the overall pattern rather than specific details in the calculation process, the calculation process has a certain degree of robustness to noise, so that the expression and the description of the overall interleaving pattern among different voice fragments are more convenient. In a specific example of the present application, a specific encoding process for calculating a hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix includes: firstly, calculating voice signal fragment waveform semantic hash coding feature vectors of the voice signal fragment waveform semantic feature vectors in the sequence of the voice signal fragment waveform semantic feature vectors by using a hash mapping function so as to obtain the sequence of the voice signal fragment waveform semantic hash coding feature vectors; then, calculating cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of the voice signal segment waveform semantic hash coding feature vectors to obtain a sequence of hash similarity; and then, two-dimensionally arranging the sequences of the hash similarity to obtain the segment waveform consistency topology matrix. More specifically, calculating the cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of voice signal segment waveform semantic hash coding feature vectors to obtain the sequence of hash similarities comprises: firstly, calculating the sum of the element-by-element point multiplication according to the position of the previous voice signal segment waveform semantic hash coding feature vector and the next voice signal segment waveform semantic hash coding feature vector in each two voice signal segment waveform semantic hash coding feature vectors to obtain an element-by-element association projection fusion value between signal segments; then, after Euclidean norms of the waveform semantic hash coding feature vector of the previous voice signal segment and the waveform semantic hash coding feature vector of the next voice signal segment are respectively calculated, the obtained two Euclidean norms are multiplied to obtain a semantic interaction fusion value between the signal segments; and finally, dividing the element-by-element association projection fusion value among the signal fragments by the semantic interaction fusion value among the signal fragments to obtain the hash similarity.

Specifically, the step S33 is to pass the sequence of the speech signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context speech signal segment waveform semantic feature vectors. Those of ordinary skill in the art will appreciate that the graph roll-up neural network (GCN) model is particularly adept at processing graph structure data, enabling capture of complex relationships between nodes. In speech recognition, this may be used to model global context relationships between speech signal segments, thereby improving the relevance between speech signal segments. More specifically, the graph roll-up neural network model achieves deep fusion of features by simultaneously processing local features of a speech signal (a sequence of segment waveform semantic feature vectors of the speech signal) and global structure information (the segment waveform consistency topology matrix), and learns long-distance dependency relationships existing in the speech signal, such as pronunciation of a word may be affected by a preceding or following word. This can enhance the semantic understanding capabilities of the model.

Accordingly, in one possible implementation, the sequence of speech signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix may be passed through a global context semantic encoder based on a graph-convolution neural network model to obtain the sequence of contextual speech signal segment waveform semantic feature vectors, for example: inputting the sequence of the voice signal fragment waveform semantic feature vector and the fragment waveform consistency topology matrix; taking a sequence of waveform semantic feature vectors of the voice signal fragments as node features, wherein each node represents a feature vector of one voice signal fragment; defining connection relations among nodes by using a segment waveform consistency topology matrix, wherein the topology matrix represents similarity or association among the nodes; defining a graph convolution neural network model for learning a representation of node features on a graph structure; constructing a global context semantic encoder for learning global context information and encoding semantic features; in the global context semantic encoder, feature propagation and integration will take place over the entire graph structure, propagating and integrating node features through multiple rounds of GCN operations; after processing by the global context semantic encoder, each node (i.e., the feature vector of each speech signal segment) obtains an updated representation to obtain a sequence of waveform semantic feature vectors of the context speech signal segment.

It should be noted that, in other specific examples of the present application, the global waveform semantic feature extraction may be performed on the sequence of speech signal segments in other manners to obtain a sequence of waveform semantic feature vectors of the context speech signal segments, for example: inputting a sequence of the speech signal segments; windowing each speech signal segment, typically using a short-time fourier transform or other time-frequency conversion method to convert the time-domain signal into a frequency-domain representation; global waveform semantic features are extracted from the frequency domain representation and may include spectral features, a sound spectrum envelope, a spectral centroid, a spectral flux, and the like. These features can capture various semantic information of the speech signal; in consideration of time sequence relativity of the voice signals, context information can be introduced on the basis of global waveform semantic feature extraction; combining the global waveform semantic feature extraction result of each voice signal segment into a feature vector to obtain a sequence of the contextual voice signal segment waveform semantic feature vectors.

In particular, the step S4 is to pass the sequence of the contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally salient speech signal waveform feature vector. The self-adaptive attention weight fusion network is essentially a weight probability distribution mechanism, namely, important contents are assigned with larger weights, and other contents are reduced in weight. Such a mechanism is more focused on finding useful information in the input data that is significantly related to the current data, mining the autocorrelation between waveform semantic feature vectors of individual contextual speech signal segments. The self-adaptive attention weight fusion network can learn semantic dependency and context correlation between voice signal segment data in different local time spans in sequence data of the context voice signal segment waveform semantic feature vector, and can more comprehensively understand dynamic features and semantic change modes in the voice signal segment data by carrying out autocorrelation modeling on the whole sequence. In particular, in one specific example of the present application, as shown in fig. 4, the S4 includes: s41, calculating waveform autocorrelation attention weights of waveform semantic feature vectors of all the contextual speech signal segments in the sequence of waveform semantic feature vectors of the contextual speech signal segments to obtain a sequence of waveform autocorrelation attention weights; s42, normalizing the sequence of waveform autocorrelation attention weights to obtain the sequence of waveform autocorrelation attention weight coefficients; and S43, calculating a vector-by-vector weighted sum of the sequence of the context speech signal segment waveform semantic feature vectors with the sequence of waveform autocorrelation attention weight coefficients as weights to obtain the globally salient speech signal waveform feature vectors.

Specifically, the step S41 calculates waveform autocorrelation attention weights of waveform semantic feature vectors of each of the contextual speech signal segment waveform semantic feature vectors in the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a sequence of waveform autocorrelation attention weights. In a specific example, first, determining a weight matrix of each contextual speech signal segment waveform semantic feature vector in the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a set of weight matrices; based on the set of weight matrices, respectively calculating matrix products between each group of corresponding weight matrices and the context voice signal segment waveform semantic feature vectors to obtain a sequence of weighted context voice signal segment waveform semantic feature vectors; then, adding each weighted context voice signal segment waveform semantic feature vector in the sequence of weighted context voice signal segment waveform semantic feature vectors with a bias vector to obtain a sequence of biased context voice signal segment waveform semantic feature vectors; activating the sequence of waveform semantic feature vectors of the biased context voice signal fragments to obtain the sequence of waveform semantic feature vectors of the nonlinear transformation context voice signal fragments; and finally, calculating the product between each nonlinear transformation context voice signal segment waveform semantic feature vector and the transpose vector of the preset reference feature vector in the sequence of nonlinear transformation context voice signal segment waveform semantic feature vectors to obtain the sequence of waveform autocorrelation attention weights. Activating the sequence of the waveform semantic feature vectors of the biased contextual speech signal segment to obtain the sequence of the waveform semantic feature vectors of the nonlinear transformation contextual speech signal segment, wherein the method comprises the following steps: activating the sequence of biased contextual speech signal segment waveform semantic feature vectors using Selu activation functions to obtain the sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors.

Specifically, the step S42 normalizes the sequence of waveform autocorrelation attention weights to obtain a sequence of waveform autocorrelation attention weight coefficients. It should be appreciated that normalization may eliminate the dimensional effects between different data, ensuring that the values of the individual waveform autocorrelation attention weights are within a similar range.

Specifically, the step S43 calculates a vector-wise weighted sum of the sequence of the waveform semantic feature vectors of the contextual speech signal segment with the sequence of the waveform autocorrelation attention weight coefficients as weights to obtain the globally salient speech signal waveform feature vector. It should be appreciated that the vector-wise weighted sum operation may sum feature vectors of individual speech signal segments by importance weights to obtain globally pronounced speech signal waveform feature vectors. This allows the information in the whole sequence to be integrated into one vector, better characterizing the whole speech signal sequence.

In summary, in the above embodiment, passing the sequence of the contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally significant speech signal waveform feature vector includes: using an adaptive attention weight fusion network to enable the sequence of the context voice signal segment waveform semantic feature vectors to pass through the adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors according to the following formula; wherein, the formula is:

；

wherein, For the first of the sequence of contextual speech signal segment waveform semantic feature vectorsSemantic feature vectors of waveforms of the individual context voice signal segments; And Respectively representing a transpose vector and a weight matrix of the preset reference feature vector,As a result of the offset vector,Representation ofThe function of the function is that,Is thatThe variables of the function are used to determine,For the kth waveform autocorrelation attention weight,Is the firstThe individual waveforms are auto-correlated with the attention weight,Is natural constantA kind of electronic deviceTo the power of the method,Attention weighting coefficients for waveform autocorrelation,AndAre all the super-parameters of the method,An exponential operation is represented by the formula,For the number of feature vectors in the sequence of waveform semantic feature vectors of the contextual speech signal segment,The speech signal waveform feature vectors are pronounced for the universe.

It should be noted that, in other specific examples of the present application, the sequence of the contextual speech signal segment waveform semantic feature vectors may also be passed through an adaptive attention weight fusion network to obtain a globally significant speech signal waveform feature vector in other manners, for example: inputting a sequence of waveform semantic feature vectors of the contextual speech signal segment; introducing an adaptive attention mechanism for learning the importance weight of each contextual speech signal segment waveform semantic feature vector; by calculating the attention weight of each vector, the network can automatically learn the attention degree of different voice signal fragments so as to better fuse information; according to the calculated attention weight, carrying out weighted fusion on the context voice signal segment waveform semantic feature vector so as to obtain a globally salient voice signal waveform feature vector; and carrying out weighted fusion on the waveform semantic feature vector of each context voice signal segment according to the attention weight so as to obtain the global salient voice signal waveform feature vector.

In particular, the S5 determines a speech recognition result based on the globally significant speech signal waveform feature vector. In particular, in one specific example of the present application, the globally significant speech signal waveform feature vector is passed through a decoder-based speech recognizer to obtain the speech recognition result. Preferably, passing the globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result specifically includes: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.

In order to promote the feature expression effect of the global-salified speech signal waveform feature vector in the complex feature expression dimension, in the above steps, the complex manifold structure of the global-salified speech signal waveform feature vector is reconstructed in the form of a global structure potential manifold dictionary based on the feature information mode and feature distribution mode of the global-salified speech signal waveform feature interaction mode expression vector, so as to promote the generation and the iteration understanding capability of the model in the iteration process for the manifold structure corresponding to the feature under the complex feature expression dimension, thereby promoting the speech recognition effect of the decoder based on the speech recognition result of the speech signal by using the global-salified speech signal waveform feature vector.

In summary, an AI-based intelligent speech recognition method according to an embodiment of the present application is explained that processes and analyzes a speech signal to be recognized by utilizing a strong feature extraction capability and a processing capability of structured data of a deep learning network model such as a Convolutional Neural Network (CNN) and a graph convolutional neural network (GCN), extracts local detail waveform features and global context semantic association features of the speech signal therefrom, and introduces an adaptive attention mechanism to integrate context information and develop important feature distribution, thereby ensuring accuracy of semantic recognition.

Further, an AI-based intelligent speech recognition system is also provided.

Fig. 5 is a block diagram of an AI-based intelligent speech recognition system, in accordance with an embodiment of the present application. As shown in fig. 5, the AI-based intelligent speech recognition system 300, according to an embodiment of the present application, includes: a signal acquisition module 310, configured to acquire a voice signal to be recognized; the signal slicing module 320 is configured to perform signal slicing on the speech signal to be identified to obtain a sequence of speech signal segments; the global waveform semantic feature extraction module 330 is configured to perform global waveform semantic feature extraction on the sequence of speech signal segments to obtain a sequence of context speech signal segment waveform semantic feature vectors; the adaptive attention weight fusion module 340 is configured to pass the sequence of the context speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a global salient speech signal waveform feature vector; and a recognition result determining module 350 for determining a speech recognition result based on the globally significant speech signal waveform feature vector.

As described above, the AI-based intelligent speech recognition system 300 according to an embodiment of the present application can be implemented in various wireless terminals, such as a server or the like having an AI-based intelligent speech recognition algorithm. In one possible implementation, the AI-based intelligent speech recognition system 300 in accordance with an embodiment of the application can be integrated into a wireless terminal as a software module and/or hardware module. For example, the AI-based intelligent speech recognition system 300 can be a software module in the operating system of the wireless terminal or can be an application developed for the wireless terminal; of course, the AI-based intelligent speech recognition system 300 could equally be one of a number of hardware modules of the wireless terminal.

In another example, the AI-based intelligent speech recognition system 300 can connect to the wireless terminal via a wired and/or wireless network and transmit the interactive information in a agreed-upon data format.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The intelligent voice recognition method based on the AI is characterized by comprising the following steps:

Acquiring a voice signal to be recognized;

performing signal segmentation on the voice signal to be recognized to obtain a sequence of voice signal fragments;

Extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors;

passing the sequence of the context voice signal segment waveform semantic feature vectors through a self-adaptive attention weight fusion network to obtain a global salient voice signal waveform feature vector; and

Determining a speech recognition result based on the global-saliency speech signal waveform feature vectors;

determining a speech recognition result based on the globally significant speech signal waveform feature vector, comprising:

passing the global-saliency speech signal waveform feature vectors through a decoder-based speech recognizer to obtain the speech recognition result;

the step of passing the global significant voice signal waveform feature vector through a decoder-based voice recognizer to obtain a voice recognition result specifically comprises the following steps: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.

2. The AI-based intelligent speech recognition method of claim 1, wherein performing global waveform semantic feature extraction on the sequence of speech signal segments to obtain a sequence of contextual speech signal segment waveform semantic feature vectors comprises:

Each voice signal segment in the sequence of voice signal segments respectively passes through a signal waveform feature extractor based on a convolutional neural network model to obtain a sequence of voice signal segment waveform semantic feature vectors;

Calculating the hash similarity between any two voice signal segment waveform semantic feature vectors in the sequence of the voice signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix;

And passing the sequence of the voice signal segment waveform semantic feature vectors and the segment waveform consistency topology matrix through a global context semantic encoder based on a graph convolution neural network model to obtain the sequence of the context voice signal segment waveform semantic feature vectors.

3. The AI-based intelligent speech recognition method of claim 2, wherein calculating the hash similarity between any two speech signal segment waveform semantic feature vectors in the sequence of speech signal segment waveform semantic feature vectors to obtain a segment waveform consistency topology matrix comprises:

Calculating the voice signal segment waveform semantic hash coding feature vector of each voice signal segment waveform semantic feature vector in the sequence of the voice signal segment waveform semantic feature vector by using a hash mapping function so as to obtain the sequence of the voice signal segment waveform semantic hash coding feature vector;

calculating cosine similarity between every two voice signal segment waveform semantic hash coding feature vectors in the sequence of the voice signal segment waveform semantic hash coding feature vectors to obtain a sequence of hash similarity; and

And carrying out two-dimensional arrangement on the sequences of the hash similarity to obtain the fragment waveform consistency topology matrix.

4. The AI-based intelligent speech recognition method of claim 3, wherein calculating cosine similarity between each two of the speech signal segment waveform semantic hash coding feature vectors in the sequence of speech signal segment waveform semantic hash coding feature vectors to obtain the sequence of hash similarities comprises:

Calculating the sum of the element-by-element point multiplication according to the position of the previous voice signal segment waveform semantic hash coding feature vector and the next voice signal segment waveform semantic hash coding feature vector in every two voice signal segment waveform semantic hash coding feature vectors to obtain an element-by-element association projection fusion value between signal segments;

Respectively calculating Euclidean norms of the waveform semantic hash coding feature vector of the previous voice signal segment and the waveform semantic hash coding feature vector of the next voice signal segment, and multiplying the obtained two Euclidean norms to obtain a semantic interaction fusion value between the signal segments; and

Dividing the element-by-element association projection fusion value among the signal fragments by the semantic interaction fusion value among the signal fragments to obtain the hash similarity.

5. The AI-based intelligent speech recognition method of claim 4, wherein passing the sequence of contextual speech signal segment waveform semantic feature vectors through an adaptive attention weight fusion network to obtain a globally salient speech signal waveform feature vector comprises:

Calculating waveform autocorrelation attention weights of waveform semantic feature vectors of all the contextual speech signal segments in the sequence of waveform semantic feature vectors of the contextual speech signal segments to obtain a sequence of waveform autocorrelation attention weights;

Normalizing the sequence of waveform autocorrelation attention weights to obtain a sequence of waveform autocorrelation attention weight coefficients; and

And calculating a vector-by-vector weighted sum of the sequence of contextual speech signal segment waveform semantic feature vectors with the sequence of waveform autocorrelation attention weight coefficients as weights to obtain the globally significant speech signal waveform feature vector.

6. The AI-based intelligent speech recognition method of claim 5, wherein calculating waveform autocorrelation attention weights for each of the sequence of contextual speech signal segment waveform semantic feature vectors to obtain a sequence of waveform autocorrelation attention weights comprises:

determining weight matrixes of the waveform semantic feature vectors of the contextual speech signal segments in the sequence of the waveform semantic feature vectors of the contextual speech signal segments to obtain a set of weight matrixes;

Based on the set of weight matrices, respectively calculating matrix products between each group of corresponding weight matrices and the context voice signal segment waveform semantic feature vectors to obtain a sequence of weighted context voice signal segment waveform semantic feature vectors;

adding each weighted context voice signal segment waveform semantic feature vector in the sequence of weighted context voice signal segment waveform semantic feature vectors with a bias vector to obtain a sequence of biased context voice signal segment waveform semantic feature vectors;

activating the sequence of waveform semantic feature vectors of the biased context voice signal fragments to obtain the sequence of waveform semantic feature vectors of the nonlinear transformation context voice signal fragments; and

And calculating the product between each nonlinear transformation context voice signal segment waveform semantic feature vector and the transpose vector of the preset reference feature vector in the sequence of nonlinear transformation context voice signal segment waveform semantic feature vectors to obtain the sequence of waveform autocorrelation attention weights.

7. The AI-based intelligent speech recognition method of claim 6, wherein activating the sequence of biased contextual speech signal segment waveform semantic feature vectors to obtain a sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors comprises:

Activating the sequence of biased contextual speech signal segment waveform semantic feature vectors using Selu activation functions to obtain the sequence of nonlinear transformed contextual speech signal segment waveform semantic feature vectors.

8. An AI-based intelligent speech recognition system, comprising:

the signal acquisition module is used for acquiring a voice signal to be recognized;

the signal segmentation module is used for carrying out signal segmentation on the voice signal to be recognized so as to obtain a sequence of voice signal fragments;

the global waveform semantic feature extraction module is used for extracting global waveform semantic features of the sequence of the voice signal fragments to obtain a sequence of context voice signal fragment waveform semantic feature vectors;

the self-adaptive attention weight fusion module is used for enabling the sequence of the context voice signal segment waveform semantic feature vectors to pass through a self-adaptive attention weight fusion network to obtain global salient voice signal waveform feature vectors;

the system further includes a recognition result determination module for determining a speech recognition result based on the globally significant speech signal waveform feature vector;

The identification result determining module is specifically configured to:

The identification result determining module is specifically configured to: dividing each eigenvalue of the globally significant speech signal waveform eigenvector by the largest eigenvalue of the globally significant speech signal waveform eigenvector to obtain a globally significant speech signal waveform semantic interaction representation vector; dividing the mean value of the eigenvalues of the waveform eigenvectors of the global salified voice signal by the standard deviation of the eigenvalues of the waveform eigenvectors of the global salified voice signal to obtain a statistical dimension interaction value corresponding to the waveform eigenvector of the global salified voice signal; subtracting the statistical dimension interaction value from each characteristic value of the global saliency speech signal waveform semantic interaction representation vector, and calculating the logarithmic value of each position of the global saliency speech signal waveform semantic interaction representation vector to obtain a global saliency speech signal waveform semantic interaction information representation vector; adding each characteristic value of the global saliency speech signal waveform semantic interaction representation vector to the statistical dimension interaction value, and multiplying the sum by a preset weight super parameter to obtain a global saliency speech signal waveform semantic interaction mode representation vector; obtaining an optimized global saliency speech signal waveform feature vector by combining the global saliency speech signal waveform semantic interaction information representation vector and the global saliency speech signal waveform semantic interaction mode representation vector point; and passing the optimized globally significant speech signal waveform feature vector through a decoder-based speech recognizer to obtain a speech recognition result.