Nothing Special   »   [go: up one dir, main page]

WO2020083110A1 - 一种语音识别、及语音识别模型训练方法及装置 - Google Patents

一种语音识别、及语音识别模型训练方法及装置 Download PDF

Info

Publication number
WO2020083110A1
WO2020083110A1 PCT/CN2019/111905 CN2019111905W WO2020083110A1 WO 2020083110 A1 WO2020083110 A1 WO 2020083110A1 CN 2019111905 W CN2019111905 W CN 2019111905W WO 2020083110 A1 WO2020083110 A1 WO 2020083110A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
target
target word
anchor
voice
Prior art date
Application number
PCT/CN2019/111905
Other languages
English (en)
French (fr)
Inventor
王珺
苏丹
俞栋
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19874914.5A priority Critical patent/EP3767619A4/en
Publication of WO2020083110A1 publication Critical patent/WO2020083110A1/zh
Priority to US17/077,141 priority patent/US11798531B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present application relates to the field of computer technology, and in particular, to a method and device for speech recognition and speech recognition model training.
  • AI Artificial Intelligence
  • Artificial Intelligence is a theory, method, technology, and application system that uses digital computers or digital computer-controlled machines to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machine has the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject, covering a wide range of fields, both hardware-level technology and software-level technology.
  • Basic technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation / interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology and machine learning / deep learning.
  • Speech Technology Automatic Speech Recognition Technology (ASR) and Speech Synthesis Technology (TTS) and Voiceprint Recognition Technology. Letting the computer listen, see, speak, and feel is the development direction of human-computer interaction in the future, and voice becomes one of the most favored human-computer interaction methods in the future.
  • ASR Automatic Speech Recognition Technology
  • TTS Speech Synthesis Technology
  • Voiceprint Recognition Technology Letting the computer listen, see, speak, and feel is the development direction of human-computer interaction in the future, and voice becomes one of the most favored human-computer interaction methods in the future.
  • the speech recognition method mainly uses a deep attractor network to generate an attractor for each speaker's speech in the mixed speech, and then estimates each attractor by calculating the distance of the embedding vector from these attractors
  • the corresponding time-frequency window is attributed to the mask weight of the corresponding speaker, thereby distinguishing the speech of each speaker according to the mask weight.
  • the voice recognition method in the prior art needs to know or estimate the number of speakers in the mixed speech in advance, so as to distinguish the voices of different speakers, but the prior art cannot track and extract the voice of a target speaker voice.
  • Embodiments of the present application provide a voice recognition, voice recognition model training method and device, electronic equipment, and storage medium, to solve the low accuracy of voice recognition in the prior art, and cannot track and recognize the voice of a target speaker The problem.
  • An embodiment of the present application provides a voice recognition method, which is executed by an electronic device and includes:
  • the target voice is recognized.
  • the speech recognition model includes a target speech extraction module and a target word judgment module.
  • the method includes:
  • the speech sample set is any one or a combination of the following: a clean target word voice sample set, a positive and negative sample set of the disturbed target word voice, and a disturbed command voice sample set;
  • Training a target speech extraction module wherein the input of the target speech extraction module is the speech sample set, and the output is the recognized target speech, and the target function of the target speech extraction module is the recognized target speech and the clean target speech The loss function between is minimized;
  • Training a target word judgment module wherein the input of the target word judgment module is the target speech output by the target speech extraction module, the output is the target word judgment probability, and the target function of the target word judgment module is the result of the target word judgment
  • the cross-entropy loss function is minimized.
  • a voice recognition device including:
  • the first obtaining module is used to recognize the target word speech from the mixed speech, and obtain the anchor extraction feature of the target word speech based on the target word speech, and use the anchor extraction feature of the target word speech as the anchor extraction feature of the target speech ;
  • a second obtaining module configured to extract features according to the anchor of the target voice, and obtain a mask of the target voice
  • the recognition module is used for recognizing the target voice according to the mask of the target voice.
  • the speech recognition model includes a target speech extraction module and a target word judgment module.
  • the device includes:
  • An acquisition module for acquiring a speech sample set; wherein, the speech sample set is any one or combination of the following: a clean target word speech sample set, a positive and negative sample set of the disturbed target word voice, and a disturbed command voice sample set;
  • a training module for training a target speech extraction module wherein the input of the target speech extraction module is the speech sample set, the output is the identified target speech, and the target function of the target speech extraction module is the identified target
  • the loss function between the speech and the clean target speech is minimized; and the target word judgment module is trained, wherein the input of the target word judgment module is the target speech output by the target speech extraction module, and the output is the target word judgment probability.
  • the objective function of the target word judgment module is to minimize the cross-entropy loss function of the target word judgment result.
  • Another embodiment of the present application provides an electronic device, including:
  • At least one memory for storing computer readable program instructions
  • At least one processor is configured to call computer-readable program instructions stored in the memory, and execute any one of the foregoing speech recognition methods or speech recognition model training methods according to the obtained computer-readable program instructions.
  • Another embodiment of the present application provides a computer-readable storage medium on which computer-readable program instructions are stored, and the computer-readable program instructions are loaded by a processor and execute any of the foregoing speech recognition methods or speech recognition models Training method.
  • the target speech anchor extraction feature corresponding to the target word speech is determined according to the target word speech in the mixed speech, so that the target speech anchor extraction feature is obtained, and the target speech mask is obtained and identified according to the mask Target voice. Further, the specific target speech can be recognized and tracked according to the target word, without the need to know or estimate the number of speakers in the mixed speech in advance, and only need to calculate the anchor extraction feature of the target speech, which improves the accuracy of speech recognition and effectiveness.
  • FIG. 1 is a flowchart of a voice recognition method in an embodiment of this application
  • FIG. 3 is a frame diagram of a voice recognition system in an embodiment of this application.
  • FIG. 4 is a structural framework diagram of an implementation solution of a target speech extraction module in an embodiment of the present application.
  • FIG. 5 is a structural framework diagram of an implementation scheme of a target word judgment module in an embodiment of the application.
  • FIG. 6 is a structural framework diagram of a training scheme based on clean target word speech in an embodiment of the present application
  • FIG. 7 is a structural framework diagram of a training scheme based on a disturbed target word speech in an original embedding space in an embodiment of the present application
  • FIG. 8 is a structural framework diagram of a training solution based on disturbed target word speech in a regular embedded space in an embodiment of the present application
  • FIG. 9 is a structural framework diagram of a test solution of a voice recognition method in an embodiment of this application.
  • FIG. 10 is a schematic diagram of a test process of a voice recognition method in an embodiment of this application.
  • FIG. 11 is a schematic structural diagram of a voice recognition device in an embodiment of this application.
  • FIG. 12 is a schematic structural diagram of a speech recognition model training device in an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device in an embodiment of this application.
  • FIG. 14 is a schematic structural diagram of a terminal in an embodiment of the present application.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robots, intelligent medical care, intelligent customer service, voice recognition, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play an increasingly important value.
  • Awakening words words that awaken artificial intelligence (Artificial Intelligence, AI) devices, so that the AI device is in a awakened state.
  • AI Artificial Intelligence
  • Embedding vector In the embodiment of the present application, a fixed-length vector representation representing a voice signal mapped to a certain dimension of embedding space.
  • Regular embedding vector In the embodiment of the present application, it represents a vector representation after two embedding space mappings.
  • Anchor extraction feature It is a voice feature representation of a voice signal.
  • the signal mask can be understood as a "bitmap", where each bit corresponds to a signal, which can be used to mask the corresponding signal.
  • n noise (noisy)
  • c clean wakeup-word
  • nw noise wakeup-word
  • Cw clean wakeup-word
  • nc noise command
  • cc clean command voice
  • the input spectrum X f, t is the short-time Fourier transform in the log domain, f represents the sequence number in the spectrum dimension, and t represents the frame sequence number in the time dimension.
  • the number of speakers in the mixed speech needs to be known or estimated in advance, so as to distinguish the speech of different speakers, but this technical solution cannot directly track or recognize a certain The speech of a specific target speaker cannot be targeted to extract the speech of the target speaker in the mixed speech.
  • the attractor calculated for each speaker in this technical solution uses a single layer of embedded space for training and learning, and the resulting attractor distribution is relatively loose and unstable, reducing the accuracy of speech recognition.
  • a late (K-means) K-means clustering method is also provided, which can make the attractor distribution relatively concentrated, but this technical solution requires the use of multi-frame speech signal clustering Therefore, it cannot support frame-by-frame real-time processing, which reduces the efficiency of speech recognition.
  • each interaction is composed of a target speaker ’s target word followed by a command voice input
  • this application mainly lies in the combination Multi-task training is performed on target words, and target speech features are determined based on the target words.
  • the person who speaks the target word is regarded as the target speaker
  • the feature of the target word voice is the target speech feature, so that the target voice is determined and tracked by identifying the target word, and the subsequently received disturbed command voice is extracted , That is, the target voice in the mixed voice, without the need to predict the number of speakers in the mixed voice.
  • double-layer embedding space is used for calculation and extraction, and the resulting target voice feature, that is, anchor extraction feature, is more concentrated and stable, which makes The accuracy of target speech recognition and extraction is higher.
  • the voice recognition method in the example of the present application may be executed by a smart terminal, or may be sent to a server after the smart terminal receives the mixed voice, and the server performs voice recognition, and sends the voice recognition result to the smart terminal.
  • the intelligent terminal and the server can be connected through the Internet to achieve mutual communication.
  • the server may be a background server that provides corresponding network services.
  • which device performs the voice recognition method there is no limitation in the embodiments of the present application.
  • target word speech is wake word speech.
  • the embodiments of the present application are mainly described and described by taking wake-up words as an example.
  • FIG. 1 is a flowchart of a voice recognition method in an embodiment of the present application, the method includes:
  • Step 100 Identify the target word speech from the mixed speech, and obtain the anchor extraction feature of the target word speech based on the target word speech, and use the anchor extraction feature of the target word speech as the anchor extraction feature of the target speech.
  • step 100 When step 100 is executed, it specifically includes steps a1 to a2:
  • step a1 the target word speech is recognized from the mixed speech.
  • the embedding vector corresponding to each time-frequency window of the mixed speech is determined; according to the determined embedding vectors and the preset anchor extraction features, the target word labeling information corresponding to each embedding vector is determined.
  • Step a2 Obtain the anchor extraction feature of the target word speech based on the target word speech, and use the anchor extraction feature of the target word speech as the anchor extraction feature of the target speech.
  • the anchor extraction feature of the target speech is obtained according to each embedding vector, preset anchor extraction feature and corresponding target word tagging information.
  • Step 110 Extract features according to the anchor of the target voice, and obtain a mask of the target voice.
  • Step 120 Identify the target voice according to the mask of the target voice.
  • the target word speech is recognized in the mixed speech, the speech features of the target word speech are learned, and the speech features of the target word speech are used as the speech features of the target speech, that is, the anchor extraction features of the target speech are obtained, and It is possible to extract features based on the anchor of the target voice, calculate the mask of the target voice, and recognize the target voice.
  • FIG. 2 is a flowchart of another voice recognition method in an embodiment of the present application, the method includes:
  • Step 200 Determine the embedding vector corresponding to each time-frequency window of the mixed speech.
  • step 200 When step 200 is executed, it specifically includes steps b1 to b2:
  • Step b1 Perform a short-time Fourier transform on the mixed speech to obtain the frequency spectrum of the mixed speech.
  • the main principle of the short-time Fourier transform is to add a sliding time window to the signal, and Fourier transform the signal in the window to obtain the time-varying frequency spectrum of the signal.
  • Step b2 Based on the pre-trained deep neural network, the frequency spectrum of the mixed speech is mapped into the original embedding space of a fixed dimension to obtain the embedding vector corresponding to each time-frequency window of the mixed speech.
  • the frequency spectrum of the mixed speech after the short-time Fourier transform is X f, t
  • the deep neural network is, for example, a Long Short-Term Memory (LSTM), which is not limited in the embodiments of the present application.
  • the deep neural network is composed of 4 bidirectional LSTM layers, each layer of LSTM has 600 nodes, and the specific parameter settings can be set and adjusted according to the actual situation.
  • the embodiment of the present application does not specifically limit the model type and topology of the deep neural network, and it can also be various other effective new model structures, for example, Convolutional Network (CNN) and other networks. Models that combine structures, or other network structures, such as time-delay networks, gated convolutional neural networks, etc.
  • the topology structure of the deep neural network can be expanded or simplified according to the actual application limitation on the memory consumption of the model and the requirement on detection accuracy.
  • the embedding vector represents a fixed-length vector representation of the voice signal mapped into a certain dimension space, and the embedding vector V f, t ⁇ R k .
  • Step 210 According to the determined embedding vectors and the preset anchor extraction features, determine the target word annotation information corresponding to each embedding vector, and according to the embedding vectors, preset anchor extraction features and corresponding target word annotation information, obtain the target speech Anchor feature extraction.
  • step 210 When step 210 is executed, it specifically includes steps c1 to c2:
  • Step c1 According to the determined embedding vectors and preset anchor extraction features, determine the target word labeling information corresponding to each embedding vector.
  • the embedding vectors and the extracted features of the preset anchors are merged separately; the merged vectors are input to the pre-trained first forward network; the first forward network is obtained, and the merged vectors are identified and output.
  • each embedding vector is V f, t and the default anchor extraction feature is Let V f, t and It is merged into 2K-dimensional vectors, input into the first forward network, and the corresponding target word annotation information is predicted, denoted as Y f, t , so as to obtain the annotation information of whether each embedding vector belongs to the target speech.
  • the target word speech can be recognized from the mixed speech.
  • the preset anchor extraction feature is the average of the center of mass of the anchor extraction feature corresponding to the clean target word speech sample of each user in the clean target word speech sample set obtained by pre-training, which is the clean target word speech sample obtained by pre-training.
  • the anchor extraction features obtained through training in the embodiment of the present application are more concentrated and stable, the anchor extraction features used in the speech recognition application are also more accurate, thereby making the subsequent calculation of the anchor extraction features of the target speech more accurate , which improves the accuracy of target speech recognition and extraction.
  • Step c2 Obtain the anchor extraction feature of the target speech according to each embedding vector, preset anchor extraction feature and corresponding target word tagging information.
  • the anchor extraction feature of the target word speech is obtained, and the anchor extraction feature of the target word speech is used as the anchor extraction feature of the target speech.
  • this embodiment describes obtaining the anchor extraction feature as directly obtaining the anchor extraction feature of the target speech.
  • it is also described as directly obtaining the anchor extraction feature of the target speech.
  • the actual calculation is the anchor extraction features of the target word speech. Since the voice characteristics of the target voice and the target word voice are consistent, the embodiments of the present application can learn and track the target voice through the target word voice. Therefore, the embodiment of the present application may use the anchor extraction feature of the target word speech as the anchor extraction feature of the target speech.
  • each embedding vector is V f, t and the default anchor extraction feature is
  • the target word label information is Y f, t
  • the anchor extraction feature of the target speech is A nw
  • is an adjustment parameter.
  • the larger ⁇ indicates that the calculated anchor extraction feature is more biased toward the estimated anchor extraction feature of the target speech, and the smaller ⁇ is, the more anchor extraction feature is biased toward the preset anchor extraction feature.
  • the value of ⁇ can also be adjusted to update the anchor extraction feature of the target voice, thereby improving the accuracy of the anchor extraction feature of the target voice.
  • Step 220 Extract features according to each embedding vector and the anchor of the target speech, obtain a regular embedding vector corresponding to each embedding vector, and extract features according to each regular embedding vector and the preset regular anchor to obtain a mask of the target speech.
  • step 220 When step 220 is executed, it specifically includes steps d1 to d2:
  • Step d1 Extract features according to each embedding vector and the anchor of the target speech to obtain a regular embedding vector corresponding to each embedding vector.
  • each combined 2K dimensional vector is mapped into a fixed embedding space of a fixed dimension again to obtain the corresponding K dimensional vector output from the second forward network, and the output K dimensional vector is used as The regular embedding vector of the corresponding embedding vector; wherein, the second forward network is used to map the original embedding space to the regular embedding space.
  • each embedding vector is V f, t and the anchor extraction feature of the target speech is Anw , then the regular embedding vectors obtained are:
  • f ( ⁇ ) represents the nonlinear mapping function learned by the deep neural network, and its function is to map the original embedding space to the new regular embedding space.
  • the parameters of the second forward network can also be set according to the actual situation, for example, a 2-layer forward network, the number of nodes in each layer is 256, the input is a 2K-dimensional vector, and the output is a K-dimensional vector.
  • the topology of the forward network can also be expanded or simplified according to the actual application restrictions on the memory consumption of the model and the requirements on the accuracy of detection, which is not limited in the embodiments of the present application.
  • the regular embedding vector represents the vector after two spatial mappings, and the first mapping is based on the mixed speech spectrum, and the second mapping is based on the embedding vector after the first mapping and the calculated target speech Anchor extraction feature.
  • the mixed speech is mapped twice in the embedding space, that is, based on the double-layer embedding space, and finally the mixed speech is mapped to the regular embedding space, so that the feature can be extracted in accordance with the regular anchor of the target speech To calculate the mask of the target voice.
  • regularization the influence of interference can be reduced, and the distribution of the feature extraction of the regular anchor of the target speech is more concentrated and stable, thereby improving the accuracy of the recognized target speech.
  • Step d2 Extract features according to each regular embedding vector and preset regular anchor to obtain a mask of the target speech.
  • the preset regular anchor extraction feature represents the average of the centroids of the regular anchor extraction feature corresponding to the interference speech samples of each user in the disturbed speech sample set obtained by pre-training, that is, the positive and negative samples of the disturbed target word speech are obtained through pre-training
  • the average value of the regular anchor extraction feature of the target speech of the set or disturbed command speech sample set will be described in detail below.
  • the distance between each regular embedding vector and the extracted feature of the preset regular anchor is calculated respectively, and the mask of the target voice is obtained according to the value of each distance.
  • the values of each distance are mapped into the range of [0,1], and the mask of the target voice is formed according to the values of each distance after the mapping.
  • the preset regular anchor extraction feature is The regular embedding vectors are The mask of the calculated target speech is:
  • Sigmoid is an S-type function, which is used to map the variable between [0,1], that is, to map the value of each distance in the embodiment of the present application to the range of [0,1], in order to Facilitate the extraction of subsequent target speech.
  • Step 230 Identify the target voice according to the mask of the target voice.
  • the frequency spectrum of the mixed speech is X f, t and the mask of the target speech is Then the recognized target speech is:
  • the mask of the target speech is calculated based on the inner product of each regular embedding vector and the feature anchor extraction feature of the target speech, the larger the inner product value, the greater the distance between the regular embedding vector and the target speech feature anchor extraction feature Small, the greater the probability that the time-frequency window belongs to the target speech, the larger the value of the calculated mask corresponding to the time-frequency window, the The larger the value of, the more the time-frequency window is extracted, so that the final calculated target speech is closer to the actual target speaker's speech.
  • the target voice when the target voice is recognized, it may be recognized from the currently input mixed voice, or the target voice may be recognized from the subsequent mixed command voice received after the device is awake.
  • Voice recognition methods are applicable.
  • the anchor extraction feature of the target speech can also be dynamically adjusted. For example, if the target word is a wake-up word, after the wake-up word voice is recognized and the device wakes up, the target voice in the mixed voice in the device wake-up state is recognized, thereby improving the accuracy of the target voice recognition in the whole device wake-up state .
  • the embodiments of the present application provide a possible implementation manner, input the recognized target voice into a pre-trained target word judgment module, determine whether the target voice includes the target word voice, and then adjust the target voice according to the judgment result To extract features based on the anchor of the adjusted target speech, and recognize the target speech.
  • adjusting the anchor extraction feature of the target voice according to the judgment result is specifically: if the judgment result is that the target voice includes the target word voice, then adjust the preset adjustment parameter so that the calculated anchor extraction feature of the target voice preset The weight of the anchor extraction feature is reduced; if the judgment result is that the target speech does not include the target word speech, the preset adjustment parameter is adjusted so that the weight of the preset anchor extraction feature among the calculated anchor extraction features of the target speech is increased.
  • the anchor extraction feature of the target speech can be adjusted The value of ⁇ in.
  • the value of ⁇ can be increased to make the preset anchor extract the features
  • the weight of is reduced, and the weight of the anchor extraction feature of the estimated target speech is increased; if it is determined that the target speech does not include the target word speech, it means that the estimated target speech is not accurate, and the value of ⁇ can be adjusted down
  • the weight of the anchor extraction feature increases, and the weight of the estimated anchor extraction feature of the target speech decreases.
  • the target word speech recognition is also estimated. Because the target word speech recognition, that is, the target word labeling information may have errors, it may reduce the accuracy of the target speech anchor extraction feature. Therefore, in the embodiment of the present application, if the target word voice recognition is correct, when the smart terminal is not in the awake state, the recognized target voice will definitely include the target word voice, and sometimes may also include the command voice, for example, the user may simultaneously Say the target word and the scene indicated by the command. Therefore, the target word judgment is performed on the recognized target speech, that is, whether the recognized target speech includes the target word speech can also improve the accuracy of target word recognition.
  • the target speech includes the target word speech
  • the weight of the preset anchor extraction feature is reduced, and the weight of the estimated anchor extraction feature of the target speech is increased.
  • the smart terminal since it is determined that the target speech includes the target word speech, after the smart terminal enters the awake state, it can extract features based on the adjusted target speech anchor and recognize the target speech from the subsequent mixed command speech, thereby extracting the target speech more precise.
  • the anchor extraction feature of the target voice can be dynamically adjusted. In this way, when recognizing the target voice in the mixed voice received after the device wakes up, it can be performed based on the anchor extraction feature of the adjusted target voice, thereby improving the accuracy of target voice recognition.
  • the intelligent terminal can recognize the target voice based on the adjusted target voice anchor extraction feature when the wake-up and after being in the wake-up state. After the smart terminal enters the sleep state again, it will restore the adjusted anchor extraction feature of the target speech to the initial preset anchor extraction feature, and then recalculate the anchor extraction feature of the target speech, and may adjust the calculated target speech again Anchor feature extraction.
  • the voice recognition method in the embodiment of the present application can be applied to multiple projects and product applications such as smart speakers, smart TV boxes, online voice interactive systems, smart voice assistants, in-vehicle smart voice devices, and simultaneous interpretation.
  • the speech recognition method in the embodiment of the present application can be applied to various far-field human-machine speech interaction scenarios, and can optimize and train the target word speech and the anchor extraction feature of the target speech, so that it can be determined according to the target word speech during application
  • the anchor of the target speech extracts features and recognizes the target speech without the need to know or estimate the number of speakers in the mixed speech in advance.
  • the voice recognition method in the embodiment of the present application can be applied to the case where the target word voice or other keyword voices are very short in length, and can also effectively track the target voice and learn its voice characteristics, which has a wider application range.
  • the influence of interference can be eliminated, and the anchor extraction feature after regularization has the advantage of more stability and concentration. Therefore, in actual application, based on the learned preset anchor extraction features and preset regular anchor extraction features, the mixed speech can be processed frame by frame in real time to reconstruct the target speaker's speech.
  • high-quality target speaker speech can be reconstructed, the performance of the reconstructed target speech signal distortion ratio (SDR) and subjective speech quality assessment (PESQ) indicators, and the wake-up and automatic are significantly improved The accuracy of the speech recognition system.
  • SDR target speech signal distortion ratio
  • PESQ subjective speech quality assessment
  • the training process is executed in the background server. Since the training of each module may be complicated and the amount of calculation is large, the training process can be implemented by the background server, so that the trained model and results can be applied to each intelligent terminal to realize voice recognition.
  • speech recognition training mainly includes two major tasks.
  • the first task is to reconstruct the clean speech of the target speaker, that is, the target speech extraction module, which is used to obtain the anchor extraction feature of the target speech based on the target word through training, and to identify the target speech from the mixed speech;
  • the second task is Target word judgment, that is, the target word judgment module, is used to judge whether the reconstructed target speech includes the target word speech, thereby improving the accuracy of the target word labeling information.
  • a method for training a speech recognition module is provided, specifically:
  • Step f1 Acquire a speech sample set.
  • the voice sample set is any one or a combination of the following: a clean target word voice sample set, a positive and negative sample set of the disturbed target word voice, and a disturbed command voice sample set.
  • Step f2 training the target speech extraction module.
  • the input of the target speech extraction module is a voice sample set
  • the output is the recognized target speech.
  • the target function of the target speech extraction module is to minimize the loss function between the recognized target speech and the clean target speech.
  • Step f3 the training target word judgment module.
  • the input of the target word judgment module is the target speech output by the target speech extraction module, and the output is the target word judgment probability, and the target function of the target word judgment module is to minimize the cross-entropy loss function of the target word judgment result.
  • the training of the target speech extraction module and the training of the target word judgment module can optimize the accuracy of identifying the target word speech and the accuracy of the target speech anchor extraction feature, so that the target word speech Feature to improve the accuracy of the target speech corresponding to the speech feature of the target word.
  • the embodiment of the present application does not limit the execution order of steps f2 and f3.
  • the speech recognition training model in the embodiment of the present application mainly includes two parts of a target speech extraction module and a target word judgment module, which will be separately introduced below.
  • the first part target speech extraction module.
  • FIG. 4 it is a structural framework diagram of an implementation solution of a target speech extraction module in an embodiment of the present application.
  • the speech recognition training process in the embodiment of the present application is similar to the actual speech recognition application process.
  • the training process of the target speech extraction module may use different speech signal sample sets for alternate training.
  • Figure 4 includes several different signal sample sets, which are the clean target word voice sample set, the positive and negative sample set of the disturbed target word voice, and the disturbed command voice sample set.
  • the embodiment of the present application provides an overall implementation solution of a target speech extraction module, specifically 1) to 5):
  • the clean target word speech sample set includes at least the clean target word speech sample and the corresponding target word annotation information
  • the positive and negative sample set of the disturbed target word speech includes at least the positive and negative samples of the disturbed target word speech and the corresponding target word annotation Information
  • the set of disturbed command voice samples at least includes the disturbed command voice samples and corresponding target word tagging information.
  • the way to determine the target word annotation information of the clean target word speech sample is:
  • the input spectrum of the speech samples of clean target words Compared with a certain threshold ⁇ , if it is determined that the difference between the spectrum amplitude of a certain time-frequency window and the highest amplitude of the input spectrum is less than the threshold, the target word corresponding to the time-frequency window is marked with information Is 0, otherwise, The value of is 1, that is
  • the value of the threshold value ⁇ is 40 dB.
  • other values may be set according to actual conditions and requirements.
  • the target word label is calculated by comparing its spectrum amplitude with the spectrum width of the clean target word speech of the target speaker.
  • a possible implementation manner is provided in the embodiments of the present application. If it is determined that the proportion of the spectrum amplitude of the clean target word speech of the target speaker in the voice sample of the interfered target word is greater than a preset ratio threshold, the interfered target word is determined Target word annotation of speech samples Is 1, otherwise the target word label of the disturbed target word speech sample is determined Is 0.
  • the preset ratio threshold is 1/2
  • the spectrum amplitude of the clean target word speech is greater than 1/2 of the spectrum amplitude of the disturbed target word speech sample, then mark Equal to "1", which means that the corresponding time-frequency signal belongs to the target speaker, otherwise, Equal to "0", which means that the corresponding time-frequency signal belongs to the interference signal, that is
  • the target word labeling information of the disturbed command voice samples at the training stage can be calculated
  • the dotted boxes in each figure indicate that each LSTM network shares the same set of parameter models, and the same parameters can be set.
  • Embedding vectors based on clean target word speech samples And the corresponding target word tag information Calculate the anchor extraction features of clean target word speech samples, specifically:
  • the anchor extraction features A cw of the clean target word speech samples of all speakers in the clean target word speech sample set are averaged to obtain the average anchor extraction features of the clean target word speech sample set
  • Deep neural network such as SLTM network
  • is an adjustment parameter, which can be adjusted through training dynamics, so that the anchor extraction feature of the target speech can be dynamically adjusted to improve its accuracy.
  • Anchor extraction features A cw of the clean target word speech samples obtained by the calculations in 2) and 3) above or the target speech anchor extraction features A nw of the disturbed target word speech samples are respectively subjected to subsequent training.
  • the clean target word speech signal stream 1 and the disturbed target word speech signal stream 2 in FIG. 4 are alternately trained to obtain anchor extraction features of target speech in different training processes, which is completed in the original embedding space, namely the first layer Computation of anchor extraction features of target speech embedded in space.
  • the output anchor extraction features of the target speech are respectively used to normalize the embedded space, that is, the calculation of the regular anchor extraction features of the target speech in the second layer of the embedded space and the calculation and extraction of the target speech mask, specifically including steps (1) ⁇ (3):
  • Step (1) according to the embedding vector of the disturbed command speech sample Extract features from the anchor of the target speech and calculate the corresponding regular embedding vector.
  • each merged 2K-dimensional vector is input into the forward network 2, based on the forward network 2, each The merged 2K-dimensional vector is mapped into the fixed-dimensional embedding space again to obtain the corresponding K-dimensional vector output from the forward network 2, and the output K-dimensional vector is used as the regular embedding vector of the corresponding embedding vector, namely
  • the forward network 2 is a two-layer forward network, the number of nodes in each layer is 256, the input is a 2K-dimensional vector, and the output is a K-dimensional regular embedded vector
  • f ( ⁇ ) represents the nonlinear mapping function learned by the deep neural network, which is used to map the original embedding space to the new regular embedding space.
  • Step (2) embedding the vector according to the rules And target speaker annotation information in the disturbed command voice sample, that is, target word annotation information Re-estimate the feature extraction of the regular anchor of the target speech, specifically:
  • Step (3) extract features according to the regular anchor of the target speech And regular embedded vectors
  • the mask of the target voice is calculated, specifically:
  • Sigmoid is an S-shaped function used to map the calculated inner product To [0,1].
  • the target speech is recognized from the disturbed target word speech samples or the disturbed command speech samples, that is, the masked spectrum of the obtained target speech is
  • the regular anchor extraction feature of the target speech is re-estimated in the regular embedded space, and the mask of the target speech is calculated, so that the distribution of the estimated anchor extraction features is more stable and concentrated.
  • the mask of the target voice can also be calculated in the original embedding space, and a specific target voice can be recognized to a certain extent, specifically: the anchor extraction feature of the target voice calculated according to the above 3)
  • a nw and the embedding vector V f, t calculate the mask of the target speech, namely:
  • M f, t Sigmoid (A nw ⁇ V f, t ), where M f, t is the mask of the target speech.
  • the obtained target speech is X f, t ⁇ M f, t .
  • the second part the target word judgment module.
  • FIG. 5 it is a structural framework diagram of an implementation scheme of a target word judgment module in an embodiment of the present application.
  • the target word judgment module in the embodiment of the present application is used to make a probabilistic judgment on whether the target speech obtained by reconstruction includes the target word.
  • the input of this module is a masked spectrum feature output by the target speech extraction module
  • the output is the judgment probability of whether it is the target word.
  • T is related to the length of the target word, for example, T takes 1.5s, and T 'takes 100ms.
  • T takes 1.5s
  • a shorter T can be set during training to achieve frame-by-frame judgment of the target speech spectrum.
  • the target word voice with a short length can effectively track and learn the characteristics of the target voice, so that the target voice in the disturbed voice can be recognized, so the embodiments of the present application are more applicable to the actual target scene. Short situation.
  • the input features of each observation window can be passed through convolutional network (Convolutional Neural Network, CNN), recurrent neural network (Recurrent Neural Network, RNN), fully connected network, softmax layer, and finally whether the output is the target The predicted probability of the word.
  • CNN convolutional Neural Network
  • RNN recurrent neural network
  • softmax layer fully connected network
  • softmax layer softmax layer
  • Specific network parameters can be adjusted according to the calculation and memory resource limits in actual application scenarios. The following examples are possible in the embodiments of the present application, including 1) to 4):
  • a CNN the number of filter channels is 32 to 256, the size of the convolution kernel is 5 to 40 in the time dimension, 1 to 20 in the spectrum dimension, and the convolution step is in the time dimension
  • the value ranges from 4 to 20, and the value ranges from 1 to 10 in the spectrum dimension.
  • An RNN the hidden unit of the RNN may be an LSTM unit or a gated loop unit (Gated Recurrent Unit, GRU), and the number of hidden units is 8 to 128.
  • GRU Gated Recurrent Unit
  • the number of nodes can be 32 ⁇ 128.
  • the target word judgment module in the embodiment of the present application does not need to use all of the above-mentioned networks, but can also use only one of the networks for training. Compared with the related art, the structure and performance of the target word judgment module given in the embodiments of the present application are better, so that the accuracy of prediction can be improved.
  • the target speech extraction module and the target word judgment module can simultaneously optimize target word speech recognition and target speech feature learning, and can effectively learn the target speech anchor extraction feature corresponding to the target word, thereby
  • the anchor extraction feature of the target speech can be used as the preset anchor extraction feature without re-estimating the anchor extraction feature, so that the obtained speech signal can be processed in real time frame by frame and reconstructed Get high-quality target speech.
  • training may be alternately performed according to different training sample sets, so the training process may specifically be divided into several different training stages.
  • the first training stage is: training based on clean target word speech
  • the second training stage is: training based on the disturbed target word speech in the original embedding space
  • the third training stage is: based on the disturbed target word speech in regularization Embedded space training.
  • the first training stage is a frame diagram of a training scheme based on clean target word speech in an embodiment of the present application.
  • the calculation method of each specific parameter is the same as the embodiment corresponding to FIG. 4 described above.
  • the input is a clean target word speech sample, a positive or negative sample of the disturbed target word speech or a disturbed command voice sample;
  • the training goal is to optimize both the target speech reconstruction task and the target word judgment task, so the training target function includes: minimizing the recognized The loss function L 1 between the target speech and the clean target speech, and the cross-entropy loss function L 2 that minimizes the judgment result of the detection target word, to reduce the error rate of the target word judgment.
  • the loss function L 1 is the error between the frequency spectrum of the reconstructed target speech and the clean target speech:
  • the cross entropy loss function L 2 (Cross Entropy, CE) function of the target word judgment result, where the target word judgment result required when calculating the cross entropy loss function, that is, the label of “yes / no target word” can be obtained by using a Gaussian The Automatic Speech Recognition (ASR) system of the Gaussian Mixed Model (GMM) / Hidden Markov Model (HMM) obtains frame-level alignment of clean target wake-up speech.
  • ASR Automatic Speech Recognition
  • GMM Gaussian Mixed Model
  • HMM Hidden Markov Model
  • the anchor extraction features A cw of the clean target word speech samples of all speakers in the clean target word speech sample set can also be averaged to obtain the average anchor extraction features of the clean target word speech sample set
  • the target speech is recognized from the disturbed target word speech samples or the disturbed command speech samples, that is, the masked spectrum
  • the objective function is to minimize the loss function between the recognized target speech and the clean target speech.
  • the recognized target speech is input to the target word judgment module to determine whether there is a target word, and the target function is the cross-entropy loss function that minimizes the target word judgment result.
  • the second training stage refer to FIG. 7, which is a frame diagram of a training scheme based on the disturbed target word speech in the original embedding space in the embodiment of the present application, and the specific calculation method of each parameter and the embodiment corresponding to FIG. 4 above the same.
  • the input is the positive and negative samples of the disturbed target word voice and / or the disturbed command voice samples;
  • the training target is basically the same as the first stage above, which includes: minimizing the loss function between the recognized target voice and the clean target voice L 1 , and the cross-entropy loss function L 2 that minimizes the judgment result of the detection target word.
  • the second stage is mainly used to optimize the relevant network parameters in the original embedding space, so the reconstructed target speech is obtained in the original embedding space, ie the obtained target speech signal
  • the second stage The input of the target word judgment module for calculating L 2 is
  • the average anchor extraction feature in the second stage is calculated by averaging the anchor extraction features of the clean target word speech samples obtained by all speakers in the training stage in the first stage.
  • the target word labeling information corresponding to each embedding vector of the disturbed target word speech sample is determined.
  • the anchor extraction feature of the target speech is obtained.
  • the target speech is recognized from the disturbed target word speech samples or the disturbed command speech samples.
  • the recognized target speech is input to the target word judgment module to determine whether there is a target word, and the target function is the cross-entropy loss function that minimizes the target word judgment result.
  • the third training stage is a frame diagram of a training scheme based on the disturbed target word speech in a regular embedding space in the embodiment of the present application, and the specific calculation method of each parameter is the same as the embodiment corresponding to FIG. 4 above the same.
  • the input of this third stage of training is the positive and negative samples of the disturbed target word speech and / or the disturbed command speech samples;
  • the training target is the same as the first stage above, which includes: minimizing the recognized target speech and the clean target speech Between the loss function L 1 and the cross-entropy loss function L 2 that minimizes the judgment result of the detection target word.
  • the third training stage is mainly used to optimize the network parameters related to regular embedded space.
  • the average anchor extraction feature in the third training stage is calculated by averaging the anchor extraction features of the clean target word speech samples obtained by all speakers in the training stage in the first stage.
  • the target word labeling information corresponding to each embedding vector of the disturbed target word speech sample is determined.
  • the anchor extraction feature of the target speech is obtained.
  • the regular anchor extraction feature of the target speech is obtained, and according to each regular embedding vector and the regular anchor extraction feature of the target speech, the target speech mask is obtained.
  • the target speech is recognized from the disturbed target word speech samples or the disturbed command speech samples.
  • the recognized target speech is input to the target word judgment module to determine whether there is a target word, and the target function is the cross-entropy loss function that minimizes the target word judgment result.
  • the above three stages of training may be performed sequentially, alternately, or iteratively.
  • an adaptive time estimation method Adaptive Moment Estimation, ADAM optimization algorithm may be used.
  • FIG. 9 it is a framework diagram of a test solution of the speech recognition method in the embodiment of the present application.
  • the test process is similar to the actual application process, that is, similar to the embodiment corresponding to FIG. 2 described above.
  • the disturbed voice that is, the input mixed voice, the label of the target voice is unknowable, including or Therefore, in the embodiment of the present application, the centroids of the anchor extraction features corresponding to the clean target word speech samples of all speakers in the training set are used as the preset anchor extraction features during the test, that is, the clean target word speech trained in the first training stage Feature extraction from the average anchor of the sample set Feature extraction as the preset anchor in the test process; and the centroid of the regular anchor extraction feature of the disturbed target word speech samples of all speakers in the training set is used as the preset regular anchor extraction feature during the test, which will be trained in the third training stage
  • the average value of the regular anchor extraction features of the target speech of the disturbed target word speech or the target speech of the disturbed command speech sample set is used as the preset regular anchor extraction features in the test process.
  • the feature Anw is extracted, and through the forward network 2, the regular embedding vector corresponding to the embedding vector is calculated.
  • the mask of the target voice Identify the target speech from the input mixed speech, that is, the masked spectrum That is, the target voice of the target speaker is reconstructed.
  • Input to the target word judgment module to make target word judgment prediction if the target word is included, the device enters the state corresponding to the target word, such as the wake-up state; if the target word is not included, the device is still in the unawakened state, and dynamically based on the judgment result Adjust the calculated anchor extraction feature Anw of the target speech to improve the accuracy of the device's recognition and tracking of the target speech in the input mixed speech in the awake state.
  • FIG. 10 is a schematic diagram of a test process of a voice recognition method in an embodiment of the present application, taking a target word as a wake-up word as an example for illustration, the method includes:
  • Step 1000 Enter the mixed voice.
  • Step 1001 The input mixed voice is passed through the target voice extraction module to recognize the target voice.
  • Step 1002 Input the target speech output by the target speech extraction module into the target word judgment module.
  • Step 1003 determine whether the target word is included, if yes, go to step 1004, otherwise, go to step 1005.
  • Step 1004 Adjust the preset adjustment parameters so that the weight of the preset anchor extraction feature in the calculated anchor extraction features of the target speech is reduced.
  • the corresponding target speech can be tracked according to the target word speech, and the anchor extraction feature of the target speech can be continuously adjusted.
  • Anchor extraction feature of the new target speech recognize the target command speech in the subsequent mixed command speech, thereby improving the accuracy of target speech recognition.
  • Step 1005 Adjust the preset adjustment parameters to increase the weight of the preset anchor extraction feature among the calculated anchor extraction features of the target speech.
  • the device may not be in the awake state and the target word voice is not detected, then the anchor extraction feature of the target voice may be more accurate than the initial preset anchor extraction feature, so the subsequent calculation When possible, use the preset anchor to extract features for calculation.
  • the voice recognition device in the embodiment of the present application specifically includes:
  • the first obtaining module 1100 is used for recognizing the target word speech from the mixed speech, and obtaining the anchor extraction feature of the target word speech based on the target word speech, and using the anchor extraction feature of the target word speech as the anchor extraction of the target speech feature;
  • the second obtaining module 1110 is configured to extract features according to the anchor of the target voice and obtain a mask of the target voice;
  • the recognition module 1120 is configured to recognize the target voice according to the mask of the target voice.
  • the first obtaining module 1100 is specifically configured to: determine an embedding vector corresponding to each time-frequency window of the mixed speech; and extract features according to the determined embedding vectors and preset anchors, and determine that the embedding vectors correspond to Target word tagging information; based on the embedding vectors, the preset anchor extraction features and the corresponding target word tagging information, obtain the anchor extraction features of the target word speech, and use the anchor extraction features of the target word speech as Anchor extraction feature of target speech.
  • the second obtaining module 1110 is specifically configured to: obtain the regular embedding vector corresponding to the embedding vectors based on the extraction features of the embedding vectors and the anchor of the target speech; according to the regular embedding vectors Extract features with a preset regular anchor to obtain a mask of the target speech.
  • the first obtaining module 1100 when determining the embedding vector corresponding to each time-frequency window of the mixed speech, is specifically used to:
  • the frequency spectrum of the mixed speech is mapped into a fixed embedding space of the original dimension to obtain an embedding vector corresponding to each time-frequency window of the mixed speech.
  • the first obtaining module 1100 when extracting features according to the determined embedding vectors and preset anchors, and determining the target word labeling information corresponding to the embedding vectors, is specifically configured to:
  • the target word label information corresponding to the embedding vector including the target word speech has a value of 1.
  • the second obtaining module 1110 when obtaining the regular embedding vector corresponding to the embedding vectors according to the embedding features of the embedding vectors and the target speech anchor, is specifically used to:
  • each combined 2K-dimensional vector is mapped again into a fixed embedding space of a fixed dimension to obtain a corresponding K-dimensional vector output by the second forward network, and the output K-dimensional vector is output
  • the vector serves as the regular embedding vector of the corresponding embedding vector; wherein, the second forward network is used to map the original embedding space to the regular embedding space.
  • the second obtaining module 1110 is specifically configured to: calculate each regular embedding vector and preset regular anchor extraction respectively For the distance between features, the mask of the target voice is obtained according to the value of each distance.
  • the voice recognition device further includes:
  • the adjustment module 1130 is used to input the recognized target speech into the pre-trained target word judgment module to determine whether the target speech includes the target word speech, and if the judgment includes the target word speech, adjust the preset adjustment parameters to make the calculation
  • the weight of the preset anchor extraction feature in the anchor extraction feature of the target speech is reduced, and if it is determined that the target word speech is not included, the preset adjustment parameter is adjusted so that the calculated anchor extraction feature of the target speech
  • the weight of the anchor extraction feature is increased; according to the adjusted anchor extraction feature of the target speech, the target speech is recognized.
  • FIG. 12 it is a schematic structural diagram of a speech recognition model training device in an embodiment of the present application, where the speech recognition model includes a target speech extraction module and a target word judgment module, and the training device includes:
  • the obtaining module 1200 is used to obtain a voice sample set; wherein, the voice sample set is any one or a combination of the following: a clean target word voice sample set, a positive and negative sample set of the disturbed target word voice, and a disturbed command voice sample set ;
  • the training module 1210 is used to train a target speech extraction module, wherein the input of the target speech extraction module is the speech sample set, and the output is the identified target speech, and the target function of the target speech extraction module is the identified The loss function between the target speech and the clean target speech is minimized; and is used to train the target word judgment module, wherein the input of the target word judgment module is the target speech output by the target speech extraction module, and the output is the target word judgment probability
  • the target function of the target word judgment module is to minimize the cross-entropy loss function of the target word judgment result.
  • the voice sample set is: a clean target word voice sample set, and a positive and negative sample set of a disturbed target word voice or a disturbed command voice sample set, wherein the clean target word voice sample set at least includes Clean target word speech and corresponding target word annotation information, the positive and negative sample set of the interfered target word speech includes at least the interfered target word speech and the corresponding target word annotation information, and the interfered command speech sample set includes at least the interfered command speech and Corresponding target word annotation information, when training the target speech extraction module, the training module 1210 is specifically used to:
  • the anchor extraction feature of the clean target word speech sample is obtained, and according to the anchor extraction of each clean target word speech sample in the clean target word speech sample set Feature, obtaining the average anchor extraction feature of the clean target word speech sample set;
  • the regular anchor extraction feature of the target speech is obtained, and according to each regular embedding vector and the regular anchor extraction feature of the target speech, a mask of the target speech is obtained;
  • the target speech is recognized from the disturbed target word speech sample or the disturbed command speech sample.
  • the training module 1210 is specifically used to:
  • the average anchor extraction feature of the clean target word speech sample set and the embedding vector of the disturbed target word speech sample determine the target word labeling information corresponding to each embedding vector of the disturbed target word speech sample;
  • the target speech is recognized from the disturbed target word speech sample or the disturbed command speech sample.
  • the training module 1210 is specifically used to train the target voice extraction module:
  • the average anchor extraction feature of the clean target word speech sample set and the embedding vector of the disturbed target word speech sample determine the target word labeling information corresponding to each embedding vector of the disturbed target word speech sample;
  • the regular anchor extraction feature of the target speech is obtained, and according to each regular embedding vector and the regular anchor extraction feature of the target speech, a mask of the target speech is obtained;
  • the target speech is recognized from the disturbed target word speech sample or the disturbed command speech sample.
  • the preset anchor extraction feature is an average anchor extraction feature of the clean target word speech sample set obtained through pre-training
  • the preset regular anchor extraction feature is the average value of the regular anchor extraction feature of the target speech of the disturbed target word speech or the disturbed command speech sample set obtained through pre-training.
  • FIG. 13 it is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device may include a processor 1310 (Center Processing Unit, CPU), a memory 1320, an input device 1330, an output device 1340, and the like.
  • the input device 1330 may include a keyboard, a mouse, and a touch screen.
  • the output device 1340 may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), a cathode ray tube (Cathode Ray Tube, CRT), and so on.
  • the electronic device may be a terminal (such as an intelligent terminal) or a server.
  • the memory 1320 may include a read-only memory (ROM) and a random access memory (RAM), and provides the processor 1310 with computer-readable program instructions and data stored in the memory 1320.
  • the memory 1320 may be used to store program instructions of the voice recognition method in the embodiment of the present application.
  • the processor 1310 may call the computer-readable program instructions stored in the memory 1320, and execute any one of the speech recognition methods and any one of the speech recognition model training methods according to the obtained program instructions.
  • the embodiments of the present application take the portable multi-function device 1400 including a touch screen as an example for description.
  • Those skilled in the art can understand that the embodiments in this application are also applicable to other devices, such as handheld devices, vehicle-mounted devices, wearable devices, computing devices, and various forms of user equipment (User Equipment, UE), mobile stations (Mobile station, MS), terminal (terminal), terminal equipment (Terminal Equipment) and so on.
  • User Equipment UE
  • mobile stations Mobile station, MS
  • terminal terminal
  • Terminal Equipment Terminal Equipment
  • the device 1400 may include an input unit 1430, a display unit 1440, a gravity acceleration sensor 1451, a proximity light sensor 1452, an ambient light sensor 1453, a memory 1420, a processor 1490, a radio frequency unit 1410, an audio circuit 1460, a speaker 1461, a microphone 1462, WiFi (wireless fidelity, wireless fidelity) module 1470, Bluetooth module 1480, power supply 1493, external interface 1497 and other components.
  • an input unit 1430 a display unit 1440, a gravity acceleration sensor 1451, a proximity light sensor 1452, an ambient light sensor 1453, a memory 1420, a processor 1490, a radio frequency unit 1410, an audio circuit 1460, a speaker 1461, a microphone 1462, WiFi (wireless fidelity, wireless fidelity) module 1470, Bluetooth module 1480, power supply 1493, external interface 1497 and other components.
  • WiFi wireless fidelity, wireless fidelity
  • FIG. 14 is only an example of a portable multi-function device, and does not constitute a limitation on the portable multi-function device.
  • the device may include more or less parts than shown, or combine certain parts, Or different parts.
  • the input unit 1430 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the portable multi-function device.
  • the input unit 1430 may include a touch screen 1431 and other input devices 1432.
  • the touch screen 1431 can collect user's touch operations on or near it (such as user operations on the touch screen or near the touch screen using any suitable objects such as fingers, joints, stylus, etc.), and drive the corresponding according to a preset program Connection device.
  • the touch screen can detect the user's touch action on the touch screen, convert the touch action into a touch signal and send it to the processor 1490, and can receive and execute the command sent by the processor 1490; the touch signal includes at least touch Point coordinate information.
  • the touch screen 1431 may provide an input interface and an output interface between the device 1400 and the user.
  • the touch screen can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 1430 may include other input devices.
  • other input devices 1432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, and so on.
  • the display unit 1440 can be used to display information input by the user or information provided to the user and various menus of the device 1400.
  • the touch screen 1431 may cover the display panel, and when the touch screen 1431 detects a touch operation on or near it, it is transmitted to the processor 1490 to determine the type of touch event, and then the processor 1490 is displayed on the display panel according to the type of touch event Provide corresponding visual output.
  • the touch screen and the display unit can be integrated into one component to realize the input, output, and display functions of the device 1400; for ease of description, the embodiments of the present application use the touch screen to represent the set of functions of the touch screen and the display unit; in some implementations For example, the touch screen and the display unit can also be used as two independent components.
  • the gravity acceleration sensor 1451 can detect the magnitude of acceleration in various directions (generally three axes), and at the same time, the gravity acceleration sensor 1451 can also be used to detect the magnitude and direction of gravity when the terminal is at rest, and can be used to identify mobile phone gesture applications (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tap) and so on.
  • mobile phone gesture applications Such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tap) and so on.
  • the device 1400 may further include one or more proximity light sensors 1452 for turning off and disabling the touch screen when the device 1400 is closer to the user (for example, when the user is making a phone call); to avoid misuse of the touch screen by the user;
  • the device 1400 may also include one or more ambient light sensors 1453 to keep the touch screen closed when the device 1400 is located in a user's pocket or other dark area, to prevent the device 1400 from consuming unnecessary battery power or being mistaken when locked Operation, in some embodiments, the proximity light sensor and the ambient light sensor may be integrated in one component, or may be used as two independent components.
  • FIG. 14 shows the proximity light sensor and the ambient light sensor, it can be understood that it is not a necessary component of the device 1400, and can be omitted as needed without changing the scope of the essence of the application.
  • the memory 1420 may be used to store instructions and data.
  • the memory 1420 may mainly include a storage instruction area and a storage data area.
  • the storage data area may store the association relationship between joint touch gestures and application functions;
  • the storage instruction area may store an operating system, at least one Instructions required for the function, etc .;
  • the instructions may cause the processor 1490 to execute the speech recognition method in the embodiment of the present application.
  • the processor 1490 is the control center of the device 1400, and uses various interfaces and lines to connect the various parts of the entire mobile phone. By running or executing instructions stored in the memory 1420 and calling data stored in the memory 1420, various types of the device 1400 are executed. Function and process data to monitor the phone as a whole.
  • the processor 1490 may include one or more processing units; preferably, the processor 1490 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc.
  • the modem processor mainly handles wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 1490.
  • the processor and the memory can be implemented on a single chip. In some embodiments, they can also be implemented on separate chips.
  • the processor 1490 is also used to call instructions in the memory to implement the speech recognition method in the embodiment of the present application.
  • the radio frequency unit 1410 can be used to receive and send information or receive and send signals during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 1490; in addition, the designed uplink data is sent to the base station.
  • the RF circuit includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the radio frequency unit 1410 can also communicate with network devices and other devices through wireless communication.
  • the wireless communication may use any communication standard or protocol, including but not limited to Global System of Mobile (GSM), General Packet Radio Service (GPRS), and Code Division Multiple Access (Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (Wideband Code Multiple Division Access, WCDMA), Long Term Evolution (LTE), E-mail, Short Message Service (SMS), etc.
  • GSM Global System of Mobile
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Message Service
  • the audio circuit 1460, the speaker 1461, and the microphone 1462 may provide an audio interface between the user and the device 1400.
  • the audio circuit 1460 can transmit the converted electrical signal of the received audio data to the speaker 1461, which converts the speaker 1461 into a sound signal output; on the other hand, the microphone 1462 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 1460 After receiving, it is converted into audio data, and then processed by the audio data output processor 1490, and then sent to another terminal via the radio frequency unit 1410, or the audio data is output to the memory 1420 for further processing.
  • the audio circuit may also include a headphone jack 1463, used to provide a connection interface between the audio circuit and the headset.
  • WiFi is a short-range wireless transmission technology.
  • the device 1400 can help users send and receive e-mails, browse web pages, and access streaming media through the WiFi module 1470. It provides users with wireless broadband Internet access.
  • FIG. 14 shows the WiFi module 1470, it can be understood that it is not a necessary component of the device 1400, and can be omitted without changing the scope of the essence of the application as needed.
  • Bluetooth is a short-range wireless communication technology.
  • the use of Bluetooth technology can effectively simplify the communication between mobile communication terminal devices such as handheld computers, notebook computers and mobile phones, and can also successfully simplify the communication between these devices and the Internet (Internet).
  • the device 1400 uses the Bluetooth module 1480 to enable The data transmission between the device 1400 and the Internet becomes more rapid and efficient, broadening the way for wireless communication.
  • Bluetooth technology is an open solution that enables wireless transmission of voice and data.
  • FIG. 14 shows the WiFi module 1470, it can be understood that it is not a necessary component of the device 1400, and can be omitted without changing the scope of the essence of the application as needed.
  • the device 1400 further includes a power supply 1493 (such as a battery) that supplies power to various components.
  • a power supply 1493 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 1490 through the power management system 1494, so as to manage charging, discharging, and power management through the power management system 1494 Features.
  • the device 1400 also includes an external interface 1497, which may be a standard Micro USB interface, or a multi-pin connector, which may be used to connect the device 1400 to communicate with other devices, or may be used to connect a charger to the device 1400 Charge.
  • an external interface 1497 which may be a standard Micro USB interface, or a multi-pin connector, which may be used to connect the device 1400 to communicate with other devices, or may be used to connect a charger to the device 1400 Charge.
  • the device 1400 may further include a camera, a flash, and the like, and details are not described herein again.
  • a computer-readable storage medium on which computer-readable program instructions are stored, and the computer-readable program instructions are executed by a processor to implement any of the above method embodiments Speech recognition method and speech recognition model training method.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Therefore, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present application may be implemented on one or more volatile or non-volatile computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code The form of a computer program product.
  • volatile or non-volatile computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种语音识别方法、语音识别模型训练方法及装置,该语音识别方法包括:从混合语音中识别出目标词语音,并基于该目标词语音获得目标词语音的锚提取特征,将该目标词语音的锚提取特征作为目标语音的锚提取特征(100);根据该目标语音的锚提取特征,获得目标语音的掩码(110);根据该目标语音的掩码,识别出目标语音(120)。

Description

一种语音识别、及语音识别模型训练方法及装置
本申请要求于2018年10月25日提交中国专利局、申请号为201811251081.7、申请名称为“一种语音识别、及语音识别模型训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种语音识别、及语音识别模型训练方法及装置。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
在智能语音交互场景中,尤其是远讲条件下,通常会出现不同说话人的语音混叠的情况,目前针对混合语音中目标说话人的语音提取的研究越来越受到重视。
现有技术中,语音识别方法主要为,采用深度吸引子网络,为混合语音中每个说话人的语音生成一个吸引子,然后通过计算嵌入向量离这些吸引子的距离,来估计每个吸引子对应的时频窗口归属于相应说话人的掩码(mask)权重,从而根据掩码权重,将各个说话人的语音区分开。
但是,现有技术中的语音识别方法,需要预先知道或估计混合语音中说话人的数目,从而将不同说话人的语音区分开来,但是现有技术中不能跟踪和提取某一目标说话人的语音。
发明内容
本申请实施例提供一种语音识别、及语音识别模型训练方法及装置、电子设备及存储介质,以解决现有技术中语音识别准确性较低,并且不能跟踪和识别某一目标说话人的语音的问题。
本申请实施例提供的具体技术方案如下:
本申请一个实施例提供了一种语音识别方法,由电子设备执行,包括:
从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征;
根据所述目标语音的锚提取特征,获得所述目标语音的掩码;
根据所述目标语音的掩码,识别出所述目标语音。
本申请另一个实施例提供了一种语音识别模型训练方法,由电子设备执行,所述语音识别模型包括目标语音提取模块和目标词判断模块,该方法包括:
获取语音样本集;其中,所述语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集;
训练目标语音提取模块,其中,所述目标语音提取模块的输入为所述语音样本集,输出为识别出的目标语音,所述目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化;
训练目标词判断模块,其中,所述目标词判断模块的输入为所述目标语音提取模块输出的目标语音,输出为目标词判断概率,所述目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
本申请另一个实施例提供了一种语音识别装置,包括:
第一获得模块,用于从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征;
第二获得模块,用于根据所述目标语音的锚提取特征,获得所述目标语音的掩码;
识别模块,用于根据所述目标语音的掩码,识别出所述目标语音。
本申请另一个实施例提供了一种语音识别模型训练装置,所述语音识别模型包括目标语音提取模块和目标词判断模块,该装置包括:
获取模块,用于获取语音样本集;其中,所述语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集;
训练模块,用于训练目标语音提取模块,其中,所述目标语音提取模块的输入为所述语音样本集,输出为识别出的目标语音,所述目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化;并训练目标词判断模 块,其中,所述目标词判断模块的输入为所述目标语音提取模块输出的目标语音,输出为目标词判断概率,所述目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
本申请另一个实施例提供了一种电子设备,包括:
至少一个存储器,用于存储计算机可读程序指令;
至少一个处理器,用于调用所述存储器中存储的计算机可读程序指令,按照获得的计算机可读程序指令执行上述任一种语音识别方法或者语音识别模型训练方法。
本申请另一个实施例提供了一种计算机可读存储介质,其上存储有计算机可读程序指令,所述计算机可读程序指令被处理器加载并执行上述任一种语音识别方法或者语音识别模型训练方法。
本申请实施例中,根据混合语音中的目标词语音,来确定目标词语音对应的目标语音的锚提取特征,从而根据目标语音的锚提取特征,得到目标语音的掩码并根据掩码识别出目标语音。进一步地,可以根据目标词来识别和跟踪特定的目标语音,不需要预先知道或估计混合语音中说话人的数目,只需计算目标语音的锚提取特征即可,提高了语音识别的准确性及效率。
附图简要说明
图1为本申请实施例中一种语音识别方法的流程图;
图2为本申请实施例中另一种语音识别方法的流程图;
图3为本申请实施例中语音识别系统的框架图;
图4为本申请实施例中目标语音提取模块的实现方案的结构框架图;
图5为本申请实施例中目标词判断模块的实现方案的结构框架图;
图6为本申请实施例中基于干净目标词语音的训练方案的结构框架图;
图7为本申请实施例中基于受干扰目标词语音在原始嵌入空间的训练方案的结构框架图;
图8为本申请实施例中基于受干扰目标词语音在规整嵌入空间的训练方案的结构框架图;
图9为本申请实施例中语音识别方法的测试方案的结构框架图;
图10为本申请实施例中语音识别方法的测试流程的示意图;
图11为本申请实施例中语音识别装置的结构示意图;
图12为本申请实施例中语音识别模型训练装置的结构示意图;
图13为本申请实施例中一种电子设备的结构示意图;
图14为本申请实施例中终端的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,并不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服、语音识别等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
为便于对本申请实施例的理解,下面先对几个概念进行简单介绍:
唤醒词:表示唤醒人工智能(Artificial Intelligence,AI)设备的词语,使AI设备处于唤醒状态。
嵌入向量:本申请实施例中表示语音信号映射到某一维度嵌入空间中的固定长度的向量表示。
规整嵌入向量:本申请实施例中表示经过两次嵌入空间映射后的向量表示。
锚提取特征:为一种语音信号的语音特征表示。
掩码(mask):信号掩码可以理解为一个"位图",其中每一位都对应着一种信号,可以用于屏蔽相应的信号。
另外,本申请实施例中,使用上标“n(noisy)”表示受干扰语音,“c(clean wakeup-word)”表示干净语音;“nw(noisy wakeup-word)”表示受干扰目标词语音,“cw(clean wakeup-word)”表示干净目标词语音;“nc(noisy command)”表示受干扰命令语音,“cc(clean command)”表示干净命令语音。输入频谱X f,t是对数域的短时傅立叶变换,f表示频谱维度的序列号,t表示时间维度的帧序列号。
在本申请的一种技术方案中,针对混合语音识别,需要预先知道或估计混合语音中说话人的数目,从而将不同说话人的语音区分开,但是这种技术方案不能直接跟踪或识别某一特定目标说话人的语音,也不能针对性地提取混合语音中目标说话人的语音。
并且,该技术方案中针对每个说话人计算的吸引子,采用单层嵌入空间进行训练学习,得到的吸引子分布相对松散不稳定,降低了语音识别的准确性。在本申请的另一种技术方案中还提供了一种后期(K均值)K-means聚类方法,可以使获得的吸引子分布相对集中,但是这种技术方案需要使用多帧语音信号聚类,因此不能支持逐帧实时处理,降低了语音识别的效率。
通常在人机交互场景中,例如,智能音响、智能电视盒子的人机交互场景等,每次交互都是由目标说话人的一个目标词接一个命令语音输入构成,因此,本申请主要在于结合目标词进行多任务训练,基于目标词确定目标语音特征。本申请实施例中认为说出目标词的人为目标说话人,该目标词语音的特征即为目标语音特征, 从而通过识别目标词来确定和跟踪目标语音,并提取后续接收到的受干扰命令语音,即混合语音中的目标语音,而不需要预知混合语音中说话人的数目。这样,不仅能够有效识别和跟踪某一特定目标语音,而且本申请实施例中采用双层嵌入空间进行计算和提取,得到的目标语音特征即锚提取特征更加集中和稳定,从而使得对混合语音中目标语音的识别和提取的准确性更高。
本申请实例中的语音识别方法可以由智能终端执行,也可以在智能终端接收到混合语音后,发送给服务器,并由服务器进行语音识别,并将语音识别结果发送给智能终端。智能终端与服务器之间可以通过互联网相连,实现相互之间的通信。服务器可以是提供相应网络服务的后台服务器。对于由哪种设备执行语音识别方法,本申请实施例中并不进行限制。
并且,本申请实施例中,主要是针对目标词语音的训练和对目标语音的学习,例如目标词语音为唤醒词语音。当然,也可以基于其它适应语音或关键词进行训练和对目标语音进行学习和识别,本申请实例中并不进行限制。本申请实施例主要是以唤醒词为例进行说明和介绍的。
参阅图1所示,为本申请实施例中的语音识别方法的流程图,该方法包括:
步骤100:从混合语音中识别出目标词语音,并基于目标词语音获得目标词语音的锚提取特征,将目标词语音的锚提取特征作为目标语音的锚提取特征。
执行步骤100时,具体包括步骤a1~a2:
步骤a1,从混合语音中识别出目标词语音。
具体为,确定混合语音的每个时频窗口对应的嵌入向量;根据确定的各嵌入向量和预设锚提取特征,确定各嵌入向量分别对应的目标词标注信息。
步骤a2,基于目标词语音获得目标词语音的锚提取特征,将目标词语音的锚提取特征作为目标语音的锚提取特征。
具体为,根据各嵌入向量、预设锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征。
步骤110:根据目标语音的锚提取特征,获得目标语音的掩码。
步骤120:根据目标语音的掩码,识别出目标语音。
这样,本申请实施例中,在混合语音中识别出目标词语音,学习目标词语音的语音特征,将目标词语音的语音特征作为目标语音的语音特征,即得到目标语音的锚提取特征,进而就可以根据目标语音的锚提取特征,计算目标语音的掩码,并识别出目标语音。
具体地,参阅图2所示,为本申请实施例中另一种语音识别方法的流程图,该方法包括:
步骤200:确定混合语音的每个时频窗口对应的嵌入向量。
执行步骤200时,具体包括步骤b1~b2:
步骤b1,对混合语音进行短时傅里叶变换,获得混合语音的频谱。
其中,短时傅里叶变换的主要原理是将信号加滑动时间窗,并对窗内信号做傅立叶变换,得到信号的时变频谱。
步骤b2,基于预先训练的深度神经网络,将混合语音的频谱映射到固定维度的原始嵌入空间中,获得混合语音的每个时频窗口对应的嵌入向量。
例如,经过短时傅里叶变换后的混合语音的频谱为X f,t,经过深度神经网络映射为K维嵌入空间的嵌入向量V f,t,例如K=40,其中,f表示频谱维度的序列号,t表示时间维度的帧序列号。
其中,深度神经网络例如为长短期记忆网络(Long Short-Term Memory,LSTM),本申请实施例中并不进行限制。例如深度神经网络由4层双向LSTM层构成,每层LSTM有600个节点,具体的参数设置可以根据实际情况进行设置和调整。当然,本申请实施例中并不具体限定深度神经网络的模型类型和拓扑结构,其也可以为各种其它有效的新型的模型结构,例如,卷积网络(Convolutional Neural Network,CNN)和其它网络结构相结合的模型,或者其它网络结构,例如时延网络、闸控卷积神经网络等。本申请实施例中,可以根据实际应用对模型内存占用的限制和对检测准确率的要求,对深度神经网络的拓扑结构加以拓展或简化。
本申请实施例中,嵌入向量表示语音信号映射到某一维度空间中的固定长度的向量表示,嵌入向量V f,t∈R k
步骤210:根据确定的各嵌入向量和预设锚提取特征,确定各嵌入向量分别对应的目标词标注信息,并根据各嵌入向量、预设锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征。
执行步骤210时,具体包括步骤c1~c2:
步骤c1,根据确定的各嵌入向量和预设锚提取特征,确定各嵌入向量分别对应的目标词标注信息。
具体为:分别将各嵌入向量和预设锚提取特征进行合并;将各合并后的向量输入到预先训练的第一前向网络;获得第一前向网络对各合并后的向量进行识别后输出的各嵌入向量对应的目标词标注信息,其中,不包括目标词语音的嵌入向量对应的目标词标注信息取值为0,包括目标词语音的嵌入向量对应的目标词标注信息取值为1。
例如,各嵌入向量为V f,t,预设锚提取特征为
Figure PCTCN2019111905-appb-000001
将V f,t
Figure PCTCN2019111905-appb-000002
合并为2K维向量,输入到第一前向网络中,并预测对应的目标词标注信息,记为Y f,t,从而可以获得各嵌入向量是否属于目标语音的标注信息。
这样,通过估计混合语音中的目标词标记信息,就可以从混合语音中识别出目标词语音。
其中,预设锚提取特征为通过预先训练获得的干净目标词语音样本集中各用户的干净目标词语音样本对应的锚提取特征的质心的平均值,即为通过预先训练获得的干净目标词语音样本集的平均锚提取特征,具体的预设锚提取特征的训练将在下 文再进行具体介绍。
这样,本申请实施例中在使用该语音识别方法时,不需要重新估计锚提取特征,也不需要聚类,因此,可以支持逐帧实时处理。并且,由于本申请实施例中通过训练获得的锚提取特征更加集中和稳定,因此在语音识别应用中使用的锚提取特征也就更加准确,从而使得后续的目标语音的锚提取特征的计算更加准确,也就提高了目标语音识别和提取的准确性。
步骤c2,根据各嵌入向量、预设锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征。
具体地,根据各嵌入向量、预设锚提取特征和对应的目标词标注信息,获得目标词语音的锚提取特征,将目标词语音的锚提取特征作为目标语音的锚提取特征。这里为了方便描述,本实施例将获得锚提取特征描述为直接获得目标语音的锚提取特征。为了便于介绍和叙述,在下文中的相关描述中,也描述为直接获得目标语音的锚提取特征。
本申请实施例中,根据各嵌入向量、预设锚提取特征和对应的目标词标注信息,计算得到的实际为目标词语音的锚提取特征。由于目标语音与目标词语音的语音特征相符合,因此本申请实施例可以通过目标词语音学习并跟踪目标语音。因此,本申请实施例可以将目标词语音的锚提取特征作为目标语音的锚提取特征。
例如,各嵌入向量为V f,t,预设锚提取特征为
Figure PCTCN2019111905-appb-000003
目标词标注信息为Y f,t,目标语音的锚提取特征为A nw,则
Figure PCTCN2019111905-appb-000004
其中,α为调节参数,α越大,则说明计算出的锚提取特征越偏向于估计的目标语音的锚提取特征,反之α越小,则说明锚提取特征越偏向于预设锚提取特征。
这样,本申请实施例中还可以通过调整α的取值,更新该目标语音的锚提取特征,从而提高目标语音的锚提取特征的准确性。
步骤220:根据各嵌入向量和目标语音的锚提取特征,获得各嵌入向量对应的规整嵌入向量,并根据各规整嵌入向量和预设规整锚提取特征,获得目标语音的掩码。
执行步骤220时,具体包括步骤d1~d2:
步骤d1,根据各嵌入向量和目标语音的锚提取特征,获得各嵌入向量对应的规整嵌入向量。
具体为:1)分别将各嵌入向量和目标语音的锚提取特征进行合并,获得各合并后的2K维向量;其中,嵌入向量和目标语音的锚提取特征分别为K维向量。
2)将各合并后的2K维向量输入到预先训练的第二前向网络。
3)基于第二前向网络,将各合并后的2K维向量再次映射到固定维度的规整嵌入空间中,获得第二前向网络输出的相应的K维向量,并将输出的K维向量作为相 应的嵌入向量的规整嵌入向量;其中,第二前向网络用于将原始嵌入空间映射到规整嵌入空间。
例如,各嵌入向量为V f,t,目标语音的锚提取特征为A nw,则获得的各规整嵌入向量为:
Figure PCTCN2019111905-appb-000005
其中,f(·)表示通过深度神经网络学习到的非线性映射函数,其作用是将原始嵌入空间映射到新的规整嵌入空间。
其中,第二前向网络的参数也可以根据实际情况进行设置,例如设置为2层的前向网络,每层的节点数是256,输入为2K维向量,输出为K维向量。当然,也可以根据实际应用对模型内存占用的限制和对检测准确率的要求,对前向网络的拓扑结构加以拓展或简化,本申请实施例中并不对此进行限制。
本申请实施例中,规整嵌入向量表示经过两次嵌入空间映射后的向量,并且第一次映射基于混合语音频谱,第二次映射基于第一次映射后的嵌入向量和计算出的目标语音的锚提取特征。
这样,本申请实施例中将混合语音经过两次嵌入空间映射,即基于双层嵌入空间,最终将混合语音映射到规整嵌入空间,从而可以实现在规整嵌入空间,根据目标语音的规整锚提取特征,对目标语音的掩码进行计算。通过规整可以减少干扰影响,使得目标语音的规整锚提取特征的分布更加集中和稳定,从而提高了识别出的目标语音的准确性。
步骤d2,根据各规整嵌入向量和预设规整锚提取特征,获得目标语音的掩码。
其中,预设规整锚提取特征表示通过预先训练获得的干扰语音样本集中各用户的干扰语音样本对应的规整锚提取特征的质心的平均值,即通过预先训练获得受干扰目标词语音的正负样本集或受干扰命令语音样本集的目标语音的规整锚提取特征的平均值。预设规整锚提取特征的训练将在下文再进行具体介绍。
具体为:分别计算各规整嵌入向量和预设规整锚提取特征之间的距离,根据各距离的取值获得目标语音的掩码。
进一步地,将各距离的取值映射到[0,1]范围内,并根据映射后的各距离的取值构成目标语音的掩码。
例如,预设规整锚提取特征为
Figure PCTCN2019111905-appb-000006
各规整嵌入向量为
Figure PCTCN2019111905-appb-000007
则计算的目标语音的掩码(mask)为:
Figure PCTCN2019111905-appb-000008
其中,Sigmoid为S型函数,用于将变量映射到[0,1]之间,即用于将本申请实施例中的各距离的取值映射到[0,1]范围内,这样是为了便于后续目标语音的提取。
步骤230:根据目标语音的掩码,识别出目标语音。
例如,混合语音的频谱为X f,t,目标语音的掩码为
Figure PCTCN2019111905-appb-000009
则识别出的目标语音为:
Figure PCTCN2019111905-appb-000010
由于目标语音的掩码是根据各规整嵌入向量与目标语音的规整锚提取特征的 内积计算得到的,因此内积取值越大,说明规整嵌入向量与目标语音的规整锚提取特征的距离越小,该时频窗口归属于目标语音的概率越大,则计算出的掩码对应该时频窗口的值越大,计算出对应的
Figure PCTCN2019111905-appb-000011
的值也越大,表示该时频窗口被提取的越多,从而最终计算出的目标语音也就越接近于实际的目标说话人的语音。
本申请实施例中,识别目标语音时,可以是从当前输入的混合语音中识别,也可以在设备处于唤醒状态后,从后续接收到的混合命令语音中识别出目标语音,本申请实施例中语音识别方法都是可以适用的。
进一步地,本申请实施例中在识别出目标语音后,还可以动态调整目标语音的锚提取特征。例如,若目标词为唤醒词,识别出唤醒词语音并对设备进行唤醒后,识别在设备唤醒状态中混合语音中的目标语音,从而提高在整个设备唤醒状态中,对目标语音识别的准确性。具体地,本申请实施例提供了一种可能的实施方式,将识别出的目标语音输入到预先训练的目标词判断模块,判断目标语音中是否包括目标词语音,然后根据判断结果,调整目标语音的锚提取特征,并根据调整后的目标语音的锚提取特征,识别目标语音。
其中,根据判断结果,调整目标语音的锚提取特征,具体为:若判断结果为目标语音中包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中的预设锚提取特征的权重减小;若判断结果为目标语音中不包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中的预设锚提取特征的权重增加。
具体地,可以调整上述目标语音的锚提取特征
Figure PCTCN2019111905-appb-000012
中的α的取值。在该目标语音的锚提取特征的计算方式中,若判断目标语音中包括目标词语音,则说明估计的目标语音接近于实际的目标语音,可以调大α的取值,使预设锚提取特征的权重减小,估计出的目标语音的锚提取特征的权重增大;若判断目标语音中不包括目标词语音,则说明估计的目标语音不准确,可以调小α的取值,使预设锚提取特征的权重增加,估计出的目标语音的锚提取特征的权重减小。
由于目标语音的锚提取特征是基于估计出的目标词标注信息计算得到的,因此目标词语音的识别也是估计得到的。因为目标词语音识别,即目标词标注信息可能会出现误差,因此可能会降低目标语音的锚提取特征的准确性。因此,本申请实施例中,若目标词语音识别正确,则在智能终端未处于唤醒状态时,识别出的目标语音必定会包括目标词语音,有时可能也会包括命令语音,例如,用户可能同时说出目标词和命令指示的场景。因此,对识别出的目标语音再进行目标词判断,即判断识别出的目标语音是否包括目标词语音,还可以提高目标词识别的准确性。若确定目标语音中包括目标词语音,则可以确定之前的目标词标注信息是正确的,根据目标词标注信息得到的目标语音的锚提取特征也是准确的,因此,可以调大α的取值,使预设锚提取特征的权重减小,估计出的目标语音的锚提取特征的权重增大。并且, 由于确定目标语音中包括目标词语音,智能终端进入唤醒状态之后,就可以基于调整后的目标语音的锚提取特征,从之后的混合命令语音中识别出目标语音,从而提取出的目标语音更加准确。
本申请实施例中,通过对最后重建出的目标语音进行目标词判断,并根据判断结果来调整α的取值,从而可以动态调整目标语音的锚提取特征。这样,在对设备唤醒状态之后接收到的混合语音中的目标语音进行识别时,可以基于调整后的目标语音的锚提取特征进行,从而可以提高目标语音识别的准确性。
本申请实施例中,智能终端在唤醒时和处于唤醒状态后,可以基于调整后的目标语音的锚提取特征来识别目标语音。智能终端再次进入休眠状态后,则会将调整后的目标语音的锚提取特征恢复为初始的预设锚提取特征,然后重新计算目标语音的锚提取特征,并可以再次调整该计算出的目标语音的锚提取特征。
本申请实施例中的语音识别方法可以应用于智能音箱、智能电视盒子、在线语音交互系统、智能语音助手、车载智能语音设备、同声传译等多个项目和产品应用中。本申请实施例中的语音识别方法可以应用于各远场人机语音交互场景,并可以对目标词语音和目标语音的锚提取特征进行优化和训练,从而在应用时可以根据目标词语音来确定目标语音的锚提取特征,并识别出目标语音,而不需要预先知道或估计混合语音中说话人的数目。并且,本申请实施例中的语音识别方法可以适用于目标词语音或其它关键词语音长度非常短的情况,也可以有效地跟踪目标语音并学习其语音特征,其适用范围更广。本申请实施例中,通过规整计算,可以消除干扰影响,而且经过规整后的锚提取特征具有更加稳定和集中的优势。因此,在实际应用时,可以基于学习到的预设锚提取特征和预设规整锚提取特征,对混合语音进行逐帧实时处理,重建目标说话人的语音。通过本申请实施例,可以重建得到高质量的目标说话人语音,提高了重建出的目标语音的信号失真比(SDR)和主观语音质量评估(PESQ)指标等性能,显著地改善了唤醒和自动语音识别系统的准确率。
基于上述实施例,下面对本申请实施例中语音识别的训练过程进行具体说明。
通常训练过程是在后台服务器执行。由于各个模块的训练可能比较复杂,计算量较大,因此,可以由后台服务器实现训练过程,从而可以将训练好的模型和结果应用到各个智能终端,实现语音识别。
参阅图3所示,为本申请实施例中语音识别系统的框架图。本申请实施例中,语音识别训练主要包括2大任务。第一个任务为重建目标说话人的干净语音,即目标语音提取模块,用于通过训练获得基于目标词的目标语音的锚提取特征,并从混合语音中识别出目标语音;第二个任务为目标词判断,即目标词判断模块,用于对重建出的目标语音,判断其是否包括目标词语音,从而提高目标词标注信息的准确性。本申请实施例中,提供了一种语音识别模块的训练方法,具体为:
步骤f1,获取语音样本集。其中,语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集。
步骤f2,训练目标语音提取模块。其中,目标语音提取模块的输入为语音样本集,输出为识别出的目标语音,目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化。
步骤f3,训练目标词判断模块。其中,目标词判断模块的输入为目标语音提取模块输出的目标语音,输出为目标词判断概率,目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
本申请实施例中,主要通过对目标语音提取模块的训练和目标词判断模块的训练,可以同时优化识别目标词语音的准确性和目标语音的锚提取特征的准确性,从而可以根据目标词语音特征来提高识别目标词语音特征对应的目标语音的准确性。本申请实施例对步骤f2和f3的执行顺序没有限制。
基于上述实施例中的图3,可知本申请实施例中的语音识别训练模型主要包括目标语音提取模块和目标词判断模块两部分,下面分别进行介绍。
第一部分:目标语音提取模块。
参阅图4所示,为本申请实施例中的目标语音提取模块的实现方案的结构框架图。本申请实施例中语音识别的训练过程和实际语音识别的应用过程是类似的,目标语音提取模块的训练过程,可以使用不同的语音信号样本集进行交替训练。图4中包括了几种不同的信号样本集,分别为干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集。本申请实施例给出了一个目标语音提取模块的整体的实现方案,具体为1)~5):
1)干净目标词语音样本集中至少包括干净目标词语音样本和对应的目标词标注信息,受干扰目标词语音的正负样本集中至少包括受干扰目标词语音的正负样本和对应的目标词标注信息,受干扰命令语音样本集中至少包括受干扰命令语音样本和对应的目标词标注信息。
其中,干净目标词语音样本的目标词标注信息的确定方式为:
针对干净目标词语音样本,去除低能量频谱窗口噪声以得到更准确的标注
Figure PCTCN2019111905-appb-000013
具体地:将干净目标词语音样本的输入频谱
Figure PCTCN2019111905-appb-000014
与一定阈值Γ比较,若确定某一时频窗口的频谱幅度与输入频谱的最高幅度的差值小于该阈值,则该时频窗口对应的目标词标注信息
Figure PCTCN2019111905-appb-000015
的取值为0,否则,
Figure PCTCN2019111905-appb-000016
的取值为1,即
Figure PCTCN2019111905-appb-000017
本申请实施例中,阈值Γ的取值为40dB,当然也可以根据实际情况和需求设置其它取值。
受干扰目标词语音的正负样本的目标词标注信息的确定方式为:
针对受干扰目标词语音的正负样本,通过比较其频谱幅度与其中的目标说话人的干净目标词语音的频谱幅度,来计算目标词标注。本申请实施例中提供了一种可能的实施方式,若确定受干扰目标词语音样本中目标说话人的干净目标词语音的频 谱幅度的占比大于预设比例阈值,则确定该受干扰目标词语音样本的目标词标注
Figure PCTCN2019111905-appb-000018
的取值为1,否则确定该受干扰目标词语音样本的目标词标注
Figure PCTCN2019111905-appb-000019
的取值为0。
例如,如果预设比例阈值为1/2,则若其中干净目标词语音的频谱幅度大于受干扰目标词语音样本的频谱幅度的1/2,则标注
Figure PCTCN2019111905-appb-000020
等于“1”,其表示对应的时频信号属于目标说话人,否则,标注
Figure PCTCN2019111905-appb-000021
等于“0”,其表示对应的时频信号属于干扰信号,即
Figure PCTCN2019111905-appb-000022
同样地,可以计算得到训练阶段的受干扰命令语音样本的目标词标注信息
Figure PCTCN2019111905-appb-000023
2)首先,针对图4中干净唤醒词语音样本,例如编号1对应的干净目标词语音样本的频谱
Figure PCTCN2019111905-appb-000024
经深度神经网络映射为K维嵌入空间的嵌入向量(embedding)
Figure PCTCN2019111905-appb-000025
其中,
Figure PCTCN2019111905-appb-000026
例如,该深度神经网络由4层双向LSTM层构成,每层LSTM有600个结点,K=40。各图中的虚线框表示各个LSTM网络共享同一套参数模型,可以设置相同的参数。
根据干净目标词语音样本的嵌入向量
Figure PCTCN2019111905-appb-000027
和对应的目标词标注信息
Figure PCTCN2019111905-appb-000028
计算干净目标词语音样本的锚提取特征,具体为:
Figure PCTCN2019111905-appb-000029
然后,对干净目标词语音样本集中所有说话人的干净目标词语音样本的锚提取特征A cw求平均,获得干净目标词语音样本集的平均锚提取特征
Figure PCTCN2019111905-appb-000030
3)首先,针对图4中编号2对应的受干扰目标词语音样本的频谱
Figure PCTCN2019111905-appb-000031
经深度神经网络,例如SLTM网络,映射为K维嵌入空间的嵌入向量(embedding)
Figure PCTCN2019111905-appb-000032
然后,将受干扰目标词语音样本的嵌入向量
Figure PCTCN2019111905-appb-000033
与上述干净目标词语音样本集的平均锚提取特征
Figure PCTCN2019111905-appb-000034
合并为2K维输入向量,经过前向网络1,预测其目标词标注信息
Figure PCTCN2019111905-appb-000035
并根据标注
Figure PCTCN2019111905-appb-000036
嵌入向量
Figure PCTCN2019111905-appb-000037
平均锚提取特征
Figure PCTCN2019111905-appb-000038
计算目标说话人即目标语音在原始嵌入空间的锚提取特征A nw,具体为:
Figure PCTCN2019111905-appb-000039
其中,α为调节参数,可以通过训练动态进行调整,从而可以动态调整目标语音的锚提取特征,以提高其准确性。
4)首先,针对图4中受干扰目标词语音样本的频谱
Figure PCTCN2019111905-appb-000040
或受干扰命令语音样本的频谱
Figure PCTCN2019111905-appb-000041
Figure PCTCN2019111905-appb-000042
为例进行说明,
Figure PCTCN2019111905-appb-000043
经深度神经网络LSTM映射为K维嵌入空间的嵌入向量(embedding)
Figure PCTCN2019111905-appb-000044
然后,将
Figure PCTCN2019111905-appb-000045
与上述2)和3)计算获得的干净目标词语音样本的锚提取特征A cw或受干扰目标词语音样本中目标语音的锚提取特征A nw分别进行后续训练。
本申请实施例中,图4中干净目标词语音信号流1和受干扰目标词语音信号流2交替训练,得到不同训练过程的目标语音的锚提取特征,完成在原始嵌入空间,即第一层嵌入空间中目标语音的锚提取特征的计算。输出的目标语音的锚提取特征再分别用于规整嵌入空间,即第二层嵌入空间中目标语音的规整锚提取特征的计算和目标语音的掩码计算和提取,具体地包括步骤(1)~(3):
步骤(1),根据受干扰命令语音样本的嵌入向量
Figure PCTCN2019111905-appb-000046
和目标语音的锚提取特征,计算对应的规整嵌入向量。
具体为:将嵌入向量和目标语音的锚提取特征进行合并,获得各合并后的2K维向量,并将各合并后的2K维向量输入到前向网络2中,基于前向网络2,将各合并后的2K维向量再次映射到固定维度的嵌入空间中,获得前向网络2输出的相应的K维向量,将输出的K维向量作为相应的嵌入向量的规整嵌入向量,即
Figure PCTCN2019111905-appb-000047
其中,前向网络2为两层的前向网络,每层的结点数是256,输入是2K维向量,输出是K维的规整嵌入向量
Figure PCTCN2019111905-appb-000048
为规整嵌入向量,f(□)表示通过深度神经网络学习到的非线性映射函数,其用于将原始嵌入空间映射到新的规整嵌入空间。
步骤(2),根据规整嵌入向量
Figure PCTCN2019111905-appb-000049
和受干扰命令语音样本中目标说话人标注信息,即目标词标注信息
Figure PCTCN2019111905-appb-000050
重新估计目标语音的规整锚提取特征,具体为:
Figure PCTCN2019111905-appb-000051
其中,
Figure PCTCN2019111905-appb-000052
为目标语音的规整锚提取特征。
步骤(3),根据目标语音的规整锚提取特征
Figure PCTCN2019111905-appb-000053
和规整嵌入向量
Figure PCTCN2019111905-appb-000054
计算得到目标语音的掩码(mask),具体为:
Figure PCTCN2019111905-appb-000055
其中,
Figure PCTCN2019111905-appb-000056
为目标语音的掩码,
Figure PCTCN2019111905-appb-000057
为规整嵌入向量与目标语音的规整锚提取特征的内积,表示各规整嵌入向量与目标语音的规整锚提取特征之间的距离,Sigmoid为S型函数,用于将计算出的内积值映射到[0,1]之间。
最后,根据目标语音的掩码,从受干扰目标词语音样本或受干扰命令语音样本中识别出目标语音,即获得的目标语音的掩码后的(masked)频谱为
Figure PCTCN2019111905-appb-000058
5)本申请实施例中,上述4)中是在规整嵌入空间重新估计目标语音的规整锚提取特征,并计算目标语音的掩码,从而估计的锚提取特征的分布更加稳定集中。同时,本申请实施例中也可以在原始嵌入空间计算目标语音的掩码,并可以在一定程度上识别出某一特定目标语音,具体为:根据上述3)计算获得的目标语音的锚提取特征A nw和嵌入向量V f,t,计算得到目标语音的掩码,即:
M f,t=Sigmoid(A nw×V f,t),其中,M f,t为目标语音的掩码。
则获得的目标语音为X f,t×M f,t
第二部分:目标词判断模块。
参阅图5所示,为本申请实施例中目标词判断模块的实现方案的结构框架图。本申请实施例中的目标词判断模块用于对重建获得的目标语音进行是否包括目标词的概率判断,该模块的输入为通过目标语音提取模块输出的掩码后的(masked)频谱特征
Figure PCTCN2019111905-appb-000059
输出为是否是目标词的判断概率。
具体为:根据目标词长度设置目标词的观察窗长度T,窗移T’;根据T,分别对输入的
Figure PCTCN2019111905-appb-000060
的各观察窗的频谱进行判断。
其中,T与目标词的长短有关,例如T取1.5s,T’取100ms。本申请实施例中,在训练时可以设置更短的T,以实现对目标语音频谱的逐帧判断。这样,可以通过长度较短的目标词语音,有效地跟踪并学习目标语音的特征,从而可以识别出受干扰语音中的目标语音,因此本申请实施例更适用于实际应用场景中目标词长度较短的情况。
如图5所示,可以将各观察窗的输入特征依次经过卷积网络(Convolutional Neural Network,CNN)、循环神经网络(Recurrent Neural Network,RNN)、全连接网络、softmax层,最后输出是否为目标词的预测概率。具体网络参数可以根据实际应用场景中对计算和内存资源的限制进行权衡调整,本申请实施例中给出如下一种可能的示例,包括1)~4):
1)一个CNN,其滤波器通道个数取值为32~256,卷积核大小在时间维度取值为5~40,在频谱维度取值为1~20,卷积步幅在时间维度取值为4~20,在频谱维度取值为1~10。
2)一个RNN,RNN的隐单元可以是LSTM单元或门控循环单元(Gated Recurrent Unit,GRU),隐单元个数为8~128。
3)一个全连接网络,结点个数可以为32~128。
4)softmax层,其输出是否为目标词的预测概率。
本申请实施例中的目标词判断模块不必全部使用上述的各个网络,也可以只使用其中某个网络进行训练。相比于相关技术,本申请实施例中给出的目标词判断模块的结构和性能更好,从而可以提高预测的准确性。
这样,本申请实施例中,通过目标语音提取模块和目标词判断模块可以同时优化目标词语音识别和目标语音的特征学习,并可以有效地学习到目标词对应的目标语音的锚提取特征,从而在实际测试和使用时,可以将学习到的目标语音的锚提取特征作为预设锚提取特征,而不需要再重新估计锚提取特征,从而可以对获得的语音信号进行逐帧实时处理,并重建得到高质量的目标语音。
基于上述实施例的图3、图4和图5,本申请实施例中可以根据不同的训练样本集,分别交替进行训练,因此,训练过程具体地还可以分为几个不同的训练阶段。 第一个训练阶段为:基于干净目标词语音的训练,第二个训练阶段为:基于受干扰目标词语音在原始嵌入空间的训练,第三个训练阶段为:基于受干扰目标词语音在规整嵌入空间的训练。下面分别进行介绍:
第一个训练阶段:参阅图6所示,为本申请实施例中基于干净目标词语音的训练方案的框架图,具体的各个参数的计算方法和上述图4对应的实施例相同。
输入为干净目标词语音样本、受干扰目标词语音的正负样本或受干扰命令语音样本;训练目标为同时优化目标语音重建任务和目标词判断任务,因此训练目标函数包括:最小化识别出的目标语音与干净目标语音之间的损失函数L 1,以及最小化检测目标词判断结果的交叉熵损失函数L 2,以降低目标词判断的错误率。
其中,损失函数L 1为重建的目标语音与干净目标语音的频谱之间的误差:
Figure PCTCN2019111905-appb-000061
目标词判断结果的交叉熵损失函数L 2(Cross Entropy,CE)函数,其中,计算该交叉熵损失函数时需要的目标词判断结果,即“是/否目标词”的标注可以通过使用一个高斯混合模型(Gaussian Mixed Model,GMM)/隐马尔可夫模型(Hidden Markov Model,HMM)的自动语音识别(Automatic Speech Recognition,ASR)系统对干净目标唤醒语音进行帧级别的对齐得到。
训练获得目标语音的方法和上述图4对应的实施例的描述相同,这里简单介绍如下:
首先,获取干净目标词语音样本,以及受干扰目标词语音的正负样本或受干扰命令语音样本。
然后,分别确定干净目标词语音样本的每个时频窗口对应的嵌入向量
Figure PCTCN2019111905-appb-000062
以及受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量
Figure PCTCN2019111905-appb-000063
受干扰命令语音样本的每个时频窗口对应的嵌入向量
Figure PCTCN2019111905-appb-000064
然后,根据干净目标词语音样本的目标词标注信息
Figure PCTCN2019111905-appb-000065
和各嵌入向量
Figure PCTCN2019111905-appb-000066
获得干净目标词语音样本的锚提取特征A cw
进一步地,本申请实施例中还可以对干净目标词语音样本集中所有说话人的干净目标词语音样本的锚提取特征A cw求平均,获得干净目标词语音样本集的平均锚提取特征
Figure PCTCN2019111905-appb-000067
然后,根据干净目标词语音样本的锚提取特征A cw和受干扰目标词语音样本的嵌入向量
Figure PCTCN2019111905-appb-000068
或者根据干净目标词语音样本的锚提取特征A cw和受干扰命令语音样本的嵌入向量
Figure PCTCN2019111905-appb-000069
获得受干扰目标词语音样本对应的规整嵌入向量,或受干扰命令语音样本对应的规整嵌入向量
Figure PCTCN2019111905-appb-000070
然后,根据受干扰目标词语音样本的标注信息
Figure PCTCN2019111905-appb-000071
或受干扰命令语音样本的标注信息
Figure PCTCN2019111905-appb-000072
以及规整嵌入向量
Figure PCTCN2019111905-appb-000073
获得目标语音的规整锚提取特征
Figure PCTCN2019111905-appb-000074
然后,根据目标语音的规整锚提取特征
Figure PCTCN2019111905-appb-000075
和规整嵌入向量
Figure PCTCN2019111905-appb-000076
获得目标语音的 掩码
Figure PCTCN2019111905-appb-000077
然后,根据目标语音的掩码,从受干扰目标词语音样本或受干扰命令语音样本中识别出目标语音,即masked频谱
Figure PCTCN2019111905-appb-000078
这样,即得到训练的第一个任务的结果,目标函数为最小化识别出的目标语音与干净目标语音之间的损失函数。
最后,将识别出的目标语音输入到目标词判断模块,判断是否有目标词,目标函数为目标词判断结果的交叉熵损失函数最小化。
第二个训练阶段:参阅图7所示,为本申请实施例中基于受干扰目标词语音在原始嵌入空间的训练方案的框架图,具体的各个参数的计算方法和上述图4对应的实施例相同。
输入为受干扰目标词语音的正负样本和/或受干扰命令语音样本;训练目标与上述第一个阶段基本相同,即包括:最小化识别出的目标语音与干净目标语音之间的损失函数L 1,以及最小化检测目标词判断结果的交叉熵损失函数L 2
第二阶段主要是用于优化原始嵌入空间中相关的网络参数,因此重建出的目标语音为在原始嵌入空间中得到的,即获得的目标语音信号为
Figure PCTCN2019111905-appb-000079
即第二阶段的
Figure PCTCN2019111905-appb-000080
计算L 2的目标词判断模块的输入即为
Figure PCTCN2019111905-appb-000081
其中,第二阶段中的平均锚提取特征为对训练样本集中所有说话人在第一个阶段得到的干净目标词语音样本的锚提取特征求平均来计算得到的。
具体地:首先,获取受干扰目标词语音的正负样本和/或受干扰命令语音样本,并分别确定受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量,以及受干扰命令语音样本的每个时频窗口对应的嵌入向量。
然后,根据干净目标词语音样本集的平均锚提取特征和受干扰目标词语音样本的嵌入向量,确定受干扰目标词语音样本的各嵌入向量对应的目标词标注信息。
然后,根据受干扰目标词语音样本的各嵌入向量、平均锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征。
然后,根据目标语音的锚提取特征和受干扰目标词语音样本的各嵌入向量,或根据目标语音的锚提取特征和受干扰命令语音样本的各嵌入向量,获得目标语音的掩码。
然后,根据目标语音的掩码,从受干扰目标词语音样本或受干扰命令语音样本中识别出目标语音。
最后,将识别出的目标语音输入到目标词判断模块,判断是否有目标词,目标函数为目标词判断结果的交叉熵损失函数最小化。
第三个训练阶段:参阅图8所示,为本申请实施例中基于受干扰目标词语音在规整嵌入空间的训练方案的框架图,具体的各个参数的计算方法和上述图4对应的 实施例相同。
该第三阶段训练的输入为受干扰目标词语音的正负样本和/或受干扰命令语音样本;训练目标与上述第一个阶段相同,即包括:最小化识别出的目标语音与干净目标语音之间的损失函数L 1,以及最小化检测目标词判断结果的交叉熵损失函数L 2
第三个训练阶段主要是用于优化规整嵌入空间相关的网络参数。其中,第三个训练阶段中的平均锚提取特征为对训练集中所有说话人在第一个阶段得到的干净目标词语音样本的锚提取特征求平均来计算得到的。
具体地:首先,获取受干扰目标词语音的正负样本和/或受干扰命令语音样本,并分别确定受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量,以及受干扰命令语音样本的每个时频窗口对应的嵌入向量。
然后,根据干净目标词语音样本集的平均锚提取特征和受干扰目标词语音样本的嵌入向量,确定受干扰目标词语音样本的各嵌入向量对应的目标词标注信息。
然后,根据受干扰目标词语音样本的各嵌入向量、平均锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征。
然后,根据目标语音的锚提取特征和受干扰目标词语音样本的各嵌入向量,或者根据目标语音的锚提取特征和受干扰命令语音样本的各嵌入向量,获得受干扰目标词语音样本对应的规整嵌入向量,或受干扰命令语音样本对应的规整嵌入向量。
然后,根据对应的目标词标注信息和各规整嵌入向量,获得目标语音的规整锚提取特征,并根据各规整嵌入向量和目标语音的规整锚提取特征,获得目标语音的掩码。
然后,根据目标语音的掩码,从受干扰目标词语音样本或受干扰命令语音样本中识别出目标语音。
最后,将识别出的目标语音输入到目标词判断模块,判断是否有目标词,目标函数为目标词判断结果的交叉熵损失函数最小化。
本申请实施例中上述三个阶段的训练可以依次、交替或迭代进行,各训练过程的实现示例中均可以采用自适应时刻估计方法(Adaptive Moment Estimation,ADAM)优化算法。
下面采用具体应用场景,对本申请实施例中的语音识别方法进行说明。上述三个阶段的训练完成后,需要对训练结果进行测试,具体地参阅图9所示,为本申请实施例中语音识别方法的测试方案的框架图。
测试过程和实际应用过程类似,即与上述图2对应的实施例类似。在测试过程中,受干扰语音,即输入的混合语音中,目标语音的标注是不可知的,包括
Figure PCTCN2019111905-appb-000082
Figure PCTCN2019111905-appb-000083
因此,本申请实施例中,采用训练集中所有说话人的干净目标词语音样本对应的锚提取特征的质心作为测试时的预设锚提取特征,即将第一个训练阶段训练得到的干净目标词语音样本集的平均锚提取特征
Figure PCTCN2019111905-appb-000084
作为测试过程中的预设锚提 取特征;并采用训练集中所有说话人的受干扰目标词语音样本的规整锚提取特征的质心作为测试时的预设规整锚提取特征,即将第三个训练阶段训练得到的受干扰目标词语音的正负样本集或受干扰命令语音样本集的目标语音的规整锚提取特征的平均值,作为测试过程中的预设规整锚提取特征。
具体为:首先,获取混合语音X f,t,并通过LSTM获得该混合语音在原始空间对应的嵌入向量V f,t
然后,根据预设锚提取特征
Figure PCTCN2019111905-appb-000085
和嵌入向量V f,t,经过前向网络1,预测得到嵌入向量对应的目标词标注信息
Figure PCTCN2019111905-appb-000086
并根据
Figure PCTCN2019111905-appb-000087
和V f,t,计算得到目标语音的锚提取特征A nw
然后,根据嵌入向量V f,t和目标语音的锚提取特征A nw,经前向网络2,计算得到嵌入向量对应的规整嵌入向量
Figure PCTCN2019111905-appb-000088
然后,根据规整嵌入向量
Figure PCTCN2019111905-appb-000089
和预设规整锚提取特征
Figure PCTCN2019111905-appb-000090
计算得到目标语音的掩码(mask)
Figure PCTCN2019111905-appb-000091
然后,根据目标语音的掩码
Figure PCTCN2019111905-appb-000092
从输入的混合语音中识别出目标语音,即掩码后的(masked)频谱
Figure PCTCN2019111905-appb-000093
即重建出目标说话人的目标语音。
最后,将
Figure PCTCN2019111905-appb-000094
输入到目标词判断模块,进行目标词判断预测;若包括目标词,则设备进入目标词对应的状态,例如唤醒状态;若不包括目标词,则设备仍处于未唤醒状态,并根据判断结果动态调整计算出的目标语音的锚提取特征A nw,以提高设备在唤醒状态中对输入的混合语音中目标语音的识别和跟踪的准确性。
具体地,参阅图10所示,为本申请实施例中语音识别方法的测试流程的示意图,以目标词为唤醒词为例进行说明,该方法包括:
步骤1000:输入混合语音。
步骤1001:将输入的混合语音,经过目标语音提取模块,识别出目标语音。
步骤1002:将目标语音提取模块输出的目标语音,输入到目标词判断模块。
步骤1003:判断是否包括目标词,若是,则执行步骤1004,否则,则执行步骤1005。
步骤1004:调整预设调节参数,使计算出的目标语音的锚提取特征中的预设锚提取特征的权重减小。
这时,如果判断包括目标词,则说明设备已进入唤醒状态,则在后续目标语音提取模块中,可以根据目标词语音跟踪对应的目标语音,不断调整目标语音的锚提取特征,并根据调整后的新的目标语音的锚提取特征,识别出后续混合命令语音中的目标命令语音,从而提高目标语音识别的准确性。
步骤1005:调整预设调节参数,使计算出的目标语音的锚提取特征中的预设锚提取特征的权重增加。
这时,如果判断出不包括目标词,则可能设备还未处于唤醒状态,未检测到目标词语音,则目标语音的锚提取特征可能比初始的预设锚提取特征更加准确,因此 在后续计算时,尽量使用该预设锚提取特征进行计算。
这样,本申请实施例中,测试时不需要重新估计锚提取特征,也不需要采用现有技术中的k-means聚类算法,因此,可以支持对输入的混合语音的逐帧实时处理,并且可以基于目标词,跟踪和识别对应的目标说话人的目标语音。
基于上述实施例,参阅图11所示,本申请实施例中的语音识别装置具体包括:
第一获得模块1100,用于从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征;
第二获得模块1110,用于根据所述目标语音的锚提取特征,获得所述目标语音的掩码;
识别模块1120,用于根据所述目标语音的掩码,识别出所述目标语音。
本申请实施例中,第一获得模块1100具体用于:确定混合语音的每个时频窗口对应的嵌入向量;根据确定的各嵌入向量和预设锚提取特征,确定所述各嵌入向量分别对应的目标词标注信息;根据所述各嵌入向量、所述预设锚提取特征和所述对应的目标词标注信息,获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征。
本申请实施例中,所述第二获得模块1110具体用于:根据所述各嵌入向量和所述目标语音的锚提取特征,获得所述各嵌入向量对应的规整嵌入向量;根据各规整嵌入向量和预设规整锚提取特征,获得所述目标语音的掩码。
本申请实施例中,确定混合语音的每个时频窗口对应的嵌入向量时,第一获得模块1100具体用于:
对所述混合语音进行短时傅里叶变换,获得所述混合语音的频谱;
将所述混合语音的频谱映射到固定维度原始嵌入空间中,获得所述混合语音的每个时频窗口对应的嵌入向量。
可本申请实施例中,根据确定的各嵌入向量和预设锚提取特征,确定所述各嵌入向量分别对应的目标词标注信息时,第一获得模块1100具体用于:
分别将各嵌入向量和预设锚提取特征进行合并;
将各合并后的向量输入到预先训练的第一前向网络;
获得所述第一前向网络对各合并后的向量进行识别后输出的各嵌入向量对应的目标词标注信息,其中,不包括目标词语音的嵌入向量对应的目标词标注信息取值为0,包括目标词语音的嵌入向量对应的目标词标注信息取值为1。
本申请实施例中,根据所述各嵌入向量和所述目标语音的锚提取特征,获得所述各嵌入向量对应的规整嵌入向量时,所述第二获得模块1110具体用于:
分别将所述各嵌入向量和所述目标语音的锚提取特征进行合并,获得各合并后的2K维向量;其中,所述嵌入向量和所述目标语音的锚提取特征分别为K维向量;
将各合并后的2K维向量输入到预先训练的第二前向网络;
基于所述第二前向网络,将各合并后的2K维向量再次映射到固定维度的规整嵌入空间中,获得所述第二前向网络输出的相应的K维向量,并将输出的K维向量作为相应的嵌入向量的规整嵌入向量;其中,第二前向网络用于将原始嵌入空间映射到规整嵌入空间。
本申请实施例中,根据各规整嵌入向量和预设规整锚提取特征,获得所述目标语音的掩码时,第二获得模块1110具体用于:分别计算各规整嵌入向量和预设规整锚提取特征之间的距离,根据各距离的取值获得所述目标语音的掩码。
本申请实施例中,该语音识别装置进一步包括:
调整模块1130,用于将识别出的目标语音输入到预先训练的目标词判断模块,判断所述目标语音中是否包括目标词语音,若判断包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重减小,若判断不包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重增加;根据调整后的目标语音的锚提取特征,识别目标语音。
基于上述实施例,参阅图12所示,为本申请实施例中语音识别模型训练装置的结构示意图,其中,语音识别模型包括目标语音提取模块和目标词判断模块,该训练装置包括:
获取模块1200,用于获取语音样本集;其中,所述语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集;
训练模块1210,用于训练目标语音提取模块,其中,所述目标语音提取模块的输入为所述语音样本集,输出为识别出的目标语音,所述目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化;并用于训练目标词判断模块,其中,所述目标词判断模块的输入为所述目标语音提取模块输出的目标语音,输出为目标词判断概率,所述目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
本申请实施例中,若所述语音样本集为:干净目标词语音样本集,以及受干扰目标词语音的正负样本集或受干扰命令语音样本集,其中,干净目标词语音样本集中至少包括干净目标词语音和对应的目标词标注信息,受干扰目标词语音的正负样本集中至少包括受干扰目标词语音和对应的目标词标注信息,受干扰命令语音样本集中至少包括受干扰命令语音和对应的目标词标注信息,则训练目标语音提取模块时,训练模块1210具体用于:
获取干净目标词语音样本,以及受干扰目标词语音的正负样本或受干扰命令语音样本,并分别确定所述干净目标词语音样本的每个时频窗口对应的嵌入向量、所述受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量,以及所述受干扰命令语音样本的每个时频窗口对应的嵌入向量;
根据所述干净目标词语音样本的目标词标注信息和各嵌入向量,获得所述干净目标词语音样本的锚提取特征,并根据所述干净目标词语音样本集中各干净目标词语音样本的锚提取特征,获得所述干净目标词语音样本集的平均锚提取特征;
根据所述干净目标词语音样本的锚提取特征和受干扰目标词语音样本的嵌入向量,或者根据所述干净目标词语音样本的锚提取特征和受干扰命令语音样本的嵌入向量,获得所述受干扰目标词语音样本对应的规整嵌入向量,或所述受干扰命令语音样本对应的规整嵌入向量;
根据对应的目标词标注信息和各规整嵌入向量,获得目标语音的规整锚提取特征,并根据各规整嵌入向量和所述目标语音的规整锚提取特征,获得目标语音的掩码;
根据所述目标语音的掩码,从所述受干扰目标词语音样本或所述受干扰命令语音样本中识别出所述目标语音。
本申请实施例中,若所述语音样本集为受干扰目标词语音的正负样本集和/或受干扰命令语音样本集,则训练目标语音提取模块时,训练模块1210具体用于:
获取受干扰目标词语音的正负样本和/或受干扰命令语音样本,并分别确定所述受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量,以及所述受干扰命令语音样本的每个时频窗口对应的嵌入向量;
根据所述干净目标词语音样本集的平均锚提取特征和受干扰目标词语音样本的嵌入向量,确定受干扰目标词语音样本的各嵌入向量对应的目标词标注信息;
根据所述受干扰目标词语音样本的各嵌入向量、所述平均锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征;
根据目标语音的锚提取特征和受干扰目标词语音样本的各嵌入向量,或根据目标语音的锚提取特征和受干扰命令语音样本的各嵌入向量,获得所述目标语音的掩码;
根据所述目标语音的掩码,从所述受干扰目标词语音样本或所述受干扰命令语音样本中识别出所述目标语音。
本申请实施例中,若所述语音样本集为受干扰目标词语音的正负样本集或受干扰命令语音样本集,则训练目标语音提取模块时,训练模块1210具体用于:
获取受干扰目标词语音的正负样本和/或受干扰命令语音样本,并分别确定所述受干扰目标词语音的正负样本的每个时频窗口对应的嵌入向量,以及所述受干扰命令语音样本的每个时频窗口对应的嵌入向量;
根据所述干净目标词语音样本集的平均锚提取特征和受干扰目标词语音样本的嵌入向量,确定受干扰目标词语音样本的各嵌入向量对应的目标词标注信息;
根据所述受干扰目标词语音样本的各嵌入向量、所述平均锚提取特征和对应的目标词标注信息,获得目标语音的锚提取特征;
根据所述目标语音的锚提取特征和受干扰目标词语音样本的各嵌入向量,或者 根据所述目标语音的锚提取特征和受干扰命令语音样本的各嵌入向量,获得所述受干扰目标词语音样本对应的规整嵌入向量,或所述受干扰命令语音样本对应的规整嵌入向量;
根据对应的目标词标注信息和各规整嵌入向量,获得目标语音的规整锚提取特征,并根据各规整嵌入向量和所述目标语音的规整锚提取特征,获得目标语音的掩码;
根据所述目标语音的掩码,从所述受干扰目标词语音样本或所述受干扰命令语音样本中识别出所述目标语音。
本申请实施例中,所述预设锚提取特征为通过预先训练获得的所述干净目标词语音样本集的平均锚提取特征;
所述预设规整锚提取特征为通过预先训练获得的受干扰目标词语音的正负样本集或受干扰命令语音样本集的目标语音的规整锚提取特征的平均值。
基于上述实施例,参阅图13所示,为本申请实施例中一种电子设备的结构示意图。
本申请实施例提供了一种电子设备,该电子设备可以包括处理器1310(Center Processing Unit,CPU)、存储器1320、输入设备1330和输出设备1340等,输入设备1330可以包括键盘、鼠标、触摸屏等,输出设备1340可以包括显示设备,如液晶显示器(Liquid Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。该电子设备可以为终端(例如智能终端)或服务器等。
存储器1320可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器1310提供存储器1320中存储的计算机可读程序指令和数据。在本申请实施例中,存储器1320可以用于存储本申请实施例中语音识别方法的程序指令。
处理器1310可以调用存储器1320存储的计算机可读程序指令,并按照获得的程序指令执行本申请实施例中任一种语音识别方法以及任一种语音识别模型训练方法。
为便于说明,本申请中的实施例以包括触摸屏的便携式多功能装置1400作示例性说明。本领域技术人员可以理解的,本申请中的实施例同样适用于其他装置,例如手持设备、车载设备、可穿戴设备、计算设备,以及各种形式的用户设备(User Equipment,UE),移动台(Mobile station,MS),终端(terminal),终端设备(Terminal Equipment)等等。
图14示出了根据一些实施例的包括触摸屏的便携式多功能装置1400的框图。所述装置1400可以包括输入单元1430、显示单元1440、重力加速度传感器1451、接近光传感器1452、环境光传感器1453、存储器1420、处理器1490、射频单元1410、音频电路1460、扬声器1461、麦克风1462、WiFi(wireless fidelity,无线保真)模块1470、蓝牙模块1480、电源1493、外部接口1497等部件。
本领域技术人员可以理解,图14仅仅是便携式多功能装置的举例,并不构成 对便携式多功能装置的限定,该装置可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件。
所述输入单元1430可用于接收输入的数字或字符信息,以及产生与所述便携式多功能装置的用户设置以及功能控制有关的键信号输入。具体地,输入单元1430可包括触摸屏1431以及其他输入设备1432。所述触摸屏1431可收集用户在其上或附近的触摸操作(比如用户使用手指、关节、触笔等任何适合的物体在触摸屏上或在触摸屏附近的操作),并根据预先设定的程序驱动相应的连接装置。触摸屏可以检测用户对触摸屏的触摸动作,将所述触摸动作转换为触摸信号发送给所述处理器1490,并能接收所述处理器1490发来的命令并加以执行;所述触摸信号至少包括触点坐标信息。所述触摸屏1431可以提供所述装置1400和用户之间的输入界面和输出界面。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触摸屏。除了触摸屏1431,输入单元1430还可以包括其他输入设备。具体地,其他输入设备1432可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
所述显示单元1440可用于显示由用户输入的信息或提供给用户的信息以及装置1400的各种菜单。进一步的,触摸屏1431可覆盖显示面板,当触摸屏1431检测到在其上或附近的触摸操作后,传送给处理器1490以确定触摸事件的类型,随后处理器1490根据触摸事件的类型在显示面板上提供相应的视觉输出。在本实施例中,触摸屏与显示单元可以集成为一个部件而实现装置1400的输入、输出、显示功能;为便于描述,本申请实施例以触摸屏代表触摸屏和显示单元的功能集合;在某些实施例中,触摸屏与显示单元也可以作为两个独立的部件。
所述重力加速度传感器1451可检测各个方向上(一般为三轴)加速度的大小,同时,所述重力加速度传感器1451还可用于检测终端静止时重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等。
装置1400还可以包括一个或多个接近光传感器1452,用于当所述装置1400距用户较近时(例如当用户正在打电话时靠近耳朵)关闭并禁用触摸屏以避免用户对触摸屏的误操作;装置1400还可以包括一个或多个环境光传感器1453,用于当装置1400位于用户口袋里或其他黑暗区域时保持触摸屏关闭,以防止装置1400在锁定状态时消耗不必要的电池功耗或被误操作,在一些实施例中,接近光传感器和环境光传感器可以集成在一颗部件中,也可以作为两个独立的部件。至于装置1400还可配置陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。虽然图14示出了接近光传感器和环境光传感器,但是可以理解的是,其并不属于装置1400的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
所述存储器1420可用于存储指令和数据,存储器1420可主要包括存储指令区 和存储数据区,存储数据区可存储关节触摸手势与应用程序功能的关联关系;存储指令区可存储操作系统、至少一个功能所需的指令等;所述指令可使处理器1490执行本申请实施例中的语音识别方法。
处理器1490是装置1400的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1420内的指令以及调用存储在存储器1420内的数据,执行装置1400的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器1490可包括一个或多个处理单元;优选的,处理器1490可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1490中。在一些实施例中,处理器、存储器、可以在单一芯片上实现,在一些实施例中,他们也可以在独立的芯片上分别实现。在本申请实施例中,处理器1490还用于调用存储器中的指令以实现本申请实施例中的语音识别方法。
所述射频单元1410可用于收发信息或通话过程中信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1490处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,射频单元1410还可以通过无线通信与网络设备和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。
音频电路1460、扬声器1461、麦克风1462可提供用户与装置1400之间的音频接口。音频电路1460可将接收到的音频数据转换后的电信号,传输到扬声器1461,由扬声器1461转换为声音信号输出;另一方面,麦克风1462将收集的声音信号转换为电信号,由音频电路1460接收后转换为音频数据,再将音频数据输出处理器1490处理后,经射频单元1410以发送给比如另一终端,或者将音频数据输出至存储器1420以便进一步处理,音频电路也可以包括耳机插孔1463,用于提供音频电路和耳机之间的连接接口。
WiFi属于短距离无线传输技术,装置1400通过WiFi模块1470可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图14示出了WiFi模块1470,但是可以理解的是,其并不属于装置1400的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
蓝牙是一种短距离无线通讯技术。利用蓝牙技术,能够有效地简化掌上电脑、笔记本电脑和手机等移动通信终端设备之间的通信,也能够成功地简化以上这些设 备与因特网(Internet)之间的通信,装置1400通过蓝牙模块1480使装置1400与因特网之间的数据传输变得更加迅速高效,为无线通信拓宽道路。蓝牙技术是能够实现语音和数据无线传输的开放性方案。虽然图14示出了WiFi模块1470,但是可以理解的是,其并不属于装置1400的必须构成,完全可以根据需要在不改变申请的本质的范围内而省略。
装置1400还包括给各个部件供电的电源1493(比如电池),优选的,电源可以通过电源管理系统1494与处理器1490逻辑相连,从而通过电源管理系统1494实现管理充电、放电、以及功耗管理等功能。
装置1400还包括外部接口1497,所述外部接口可以是标准的Micro USB接口,也可以使多针连接器,可以用于连接装置1400与其他装置进行通信,也可以用于连接充电器为装置1400充电。
尽管未示出,装置1400还可以包括摄像头、闪光灯等,在此不再赘述。
基于上述实施例,本申请实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读程序指令,所述计算机可读程序指令被处理器执行时实现上述任意方法实施例中的语音识别方法以及语音识别模型训练方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的易失性或非易失性计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创 造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请实施例进行各种改动和变型而不脱离本申请实施例的精神和范围。这样,倘若本申请实施例的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (16)

  1. 一种语音识别方法,由电子设备执行,包括:
    从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征;
    根据所述目标语音的锚提取特征,获得所述目标语音的掩码;
    根据所述目标语音的掩码,识别出所述目标语音。
  2. 如权利要求1所述的方法,其中,所述从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征,具体包括:
    确定所述混合语音的每个时频窗口对应的嵌入向量;
    根据确定的各嵌入向量和预设锚提取特征,确定所述各嵌入向量分别对应的目标词标注信息;
    根据所述各嵌入向量、所述预设锚提取特征和所述对应的目标词标注信息,获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征。
  3. 如权利要求2所述的方法,其中,所述根据所述目标语音的锚提取特征,获得所述目标语音的掩码,具体包括:
    根据所述各嵌入向量和所述目标语音的锚提取特征,获得所述各嵌入向量对应的规整嵌入向量;
    根据各规整嵌入向量和预设规整锚提取特征,获得所述目标语音的掩码。
  4. 如权利要求2所述的方法,其中,所述确定所述混合语音的每个时频窗口对应的嵌入向量,具体包括:
    对所述混合语音进行短时傅里叶变换,获得所述混合语音的频谱;
    将所述混合语音的频谱映射到固定维度的原始嵌入空间中,获得所述混合语音的每个时频窗口对应的嵌入向量。
  5. 如权利要求2所述的方法,其中,所述根据确定的各嵌入向量和预设锚提取特征,确定所述各嵌入向量分别对应的目标词标注信息,具体包括:
    分别将各嵌入向量和预设锚提取特征进行合并;
    将各合并后的向量输入到预先训练的第一前向网络;
    获得所述第一前向网络对各合并后的向量进行识别后输出的各嵌入向量对应的目标词标注信息,其中,不包括目标词语音的嵌入向量对应的目标词标注信息取值为0,包括目标词语音的嵌入向量对应的目标词标注信息取值为1。
  6. 如权利要求3所述的方法,其中,所述根据所述各嵌入向量和所述目标语音的锚提取特征,获得所述各嵌入向量对应的规整嵌入向量,具体包括:
    分别将所述各嵌入向量和所述目标语音的锚提取特征进行合并,获得各合并后的2K维向量;其中,所述嵌入向量和所述目标语音的锚提取特征分别为K维向量;
    将各合并后的2K维向量输入到预先训练的第二前向网络;
    基于所述第二前向网络,将各合并后的2K维向量再次映射到固定维度的规整嵌入空间中,获得所述第二前向网络输出的相应的K维向量,并将输出的K维向量作为相应的嵌入向量的规整嵌入向量;其中,第二前向网络用于将原始嵌入空间映射到规整嵌入空间。
  7. 如权利要求3所述的方法,其中,所述根据各规整嵌入向量和预设规整锚提取特征,获得所述目标语音的掩码,具体包括:
    分别计算各规整嵌入向量和预设规整锚提取特征之间的距离,根据各距离的取值获得所述目标语音的掩码。
  8. 如权利要求1所述的方法,其中,进一步包括:
    将识别出的目标语音输入到预先训练的目标词判断模块,判断所述目标语音中是否包括目标词语音,若判断包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重减小,若判断不包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重增加;
    根据调整后的目标语音的锚提取特征,识别目标语音。
  9. 一种语音识别模型训练方法,由电子设备执行,其中,所述语音识别模型包括目标语音提取模块和目标词判断模块,该方法包括:
    获取语音样本集;其中,所述语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集;
    训练目标语音提取模块,其中,所述目标语音提取模块的输入为所述语音样本集,输出为识别出的目标语音,所述目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化;
    训练目标词判断模块,其中,所述目标词判断模块的输入为所述目标语音提取模块输出的目标语音,输出为目标词判断概率,所述目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
  10. 一种语音识别装置,包括:
    第一获得模块,用于从混合语音中识别出目标词语音,并基于所述目标词语音获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征;
    第二获得模块,用于根据所述目标语音的锚提取特征,获得所述目标语音的掩码;
    识别模块,用于根据所述目标语音的掩码,识别出所述目标语音。
  11. 如权利要求10所述的装置,其中,第一获得模块具体用于:
    确定所述混合语音的每个时频窗口对应的嵌入向量;
    根据确定的各嵌入向量和预设锚提取特征,确定所述各嵌入向量分别对应的目标词标注信息;
    根据所述各嵌入向量、所述预设锚提取特征和所述对应的目标词标注信息,获得目标词语音的锚提取特征,将所述目标词语音的锚提取特征作为目标语音的锚提取特征。
  12. 如权利要求11所述的装置,其中,所述第二获得模块具体用于:
    根据所述各嵌入向量和所述目标语音的锚提取特征,获得所述各嵌入向量对应的规整嵌入向量;
    根据各规整嵌入向量和预设规整锚提取特征,获得所述目标语音的掩码。
  13. 如权利要求10所述的装置,其中,进一步包括:
    调整模块,用于将识别出的目标语音输入到预先训练的目标词判断模块,判断所述目标语音中是否包括目标词语音,若判断包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重减小,若判断不包括目标词语音,则调整预设调节参数,使计算出的目标语音的锚提取特征中所述预设锚提取特征的权重增加;根据调整后的目标语音的锚提取特征,识别目标语音。
  14. 一种语音识别模型训练装置,其中,所述语音识别模型包括目标语音提取模块和目标词判断模块,该装置包括:
    获取模块,用于获取语音样本集;其中,所述语音样本集为以下任意一种或组合:干净目标词语音样本集、受干扰目标词语音的正负样本集、受干扰命令语音样本集;
    训练模块,用于训练目标语音提取模块,其中,所述目标语音提取模块的输入为所述语音样本集,输出为识别出的目标语音,所述目标语音提取模块的目标函数为识别出的目标语音与干净目标语音之间的损失函数最小化;并训练目标词判断模块,其中,所述目标词判断模块的输入为所述目标语音提取模块输出的目标语音,输出为目标词判断概率,所述目标词判断模块的目标函数为目标词判断结果的交叉熵损失函数最小化。
  15. 一种电子设备,包括:
    至少一个存储器,用于存储计算机可读程序指令;
    至少一个处理器,用于调用所述存储器中存储的计算机可读程序指令,按照获得的计算机可读程序指令执行如权利要求1-8任一项所述的语音识别方法或者如权利要求9所述的语音识别模型训练方法。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机可读程序指令,所述计算机可读程序指令被处理器加载并执行如权利要求1-8任一项所述的语音识别方法或者如权利要求9所述的语音识别模型训练方法。
PCT/CN2019/111905 2018-10-25 2019-10-18 一种语音识别、及语音识别模型训练方法及装置 WO2020083110A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19874914.5A EP3767619A4 (en) 2018-10-25 2019-10-18 PROCESS AND APPARATUS FOR SPEECH RECOGNITION AND SPEECH RECOGNITION MODEL LEARNING
US17/077,141 US11798531B2 (en) 2018-10-25 2020-10-22 Speech recognition method and apparatus, and method and apparatus for training speech recognition model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811251081.7A CN110176226B (zh) 2018-10-25 2018-10-25 一种语音识别、及语音识别模型训练方法及装置
CN201811251081.7 2018-10-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/077,141 Continuation US11798531B2 (en) 2018-10-25 2020-10-22 Speech recognition method and apparatus, and method and apparatus for training speech recognition model

Publications (1)

Publication Number Publication Date
WO2020083110A1 true WO2020083110A1 (zh) 2020-04-30

Family

ID=67689088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111905 WO2020083110A1 (zh) 2018-10-25 2019-10-18 一种语音识别、及语音识别模型训练方法及装置

Country Status (4)

Country Link
US (1) US11798531B2 (zh)
EP (1) EP3767619A4 (zh)
CN (5) CN110176226B (zh)
WO (1) WO2020083110A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889088A (zh) * 2021-09-28 2022-01-04 北京百度网讯科技有限公司 训练语音识别模型的方法及装置、电子设备和存储介质
CN113963715A (zh) * 2021-11-09 2022-01-21 清华大学 语音信号的分离方法、装置、电子设备及存储介质

Families Citing this family (69)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
AU2014214676A1 (en) 2013-02-07 2015-08-27 Apple Inc. Voice trigger for a digital assistant
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770427A1 (en) 2017-05-12 2018-12-20 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. MULTI-MODAL INTERFACES
US10303715B2 (en) 2017-05-16 2019-05-28 Apple Inc. Intelligent automated assistant for media exploration
KR102535411B1 (ko) * 2017-11-16 2023-05-23 삼성전자주식회사 메트릭 학습 기반의 데이터 분류와 관련된 장치 및 그 방법
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. VIRTUAL ASSISTANT OPERATION IN MULTI-DEVICE ENVIRONMENTS
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
CN110176226B (zh) * 2018-10-25 2024-02-02 腾讯科技(深圳)有限公司 一种语音识别、及语音识别模型训练方法及装置
US11475898B2 (en) * 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11900949B2 (en) 2019-05-28 2024-02-13 Nec Corporation Signal extraction system, signal extraction learning method, and signal extraction learning program
US11468890B2 (en) 2019-06-01 2022-10-11 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
US20210050003A1 (en) * 2019-08-15 2021-02-18 Sameer Syed Zaheer Custom Wake Phrase Training
CN110730046B (zh) * 2019-10-18 2022-02-18 中国人民解放军陆军工程大学 基于深度迁移学习的跨频段频谱预测方法
CN110970016B (zh) * 2019-10-28 2022-08-19 苏宁云计算有限公司 一种唤醒模型生成方法、智能终端唤醒方法及装置
WO2020035085A2 (en) * 2019-10-31 2020-02-20 Alipay (Hangzhou) Information Technology Co., Ltd. System and method for determining voice characteristics
CN112786001B (zh) * 2019-11-11 2024-04-09 北京地平线机器人技术研发有限公司 语音合成模型训练方法、语音合成方法和装置
CN111079945B (zh) * 2019-12-18 2021-02-05 北京百度网讯科技有限公司 端到端模型的训练方法及装置
CN111179911B (zh) * 2020-01-02 2022-05-03 腾讯科技(深圳)有限公司 目标语音提取方法、装置、设备、介质和联合训练方法
CN111179961B (zh) 2020-01-02 2022-10-25 腾讯科技(深圳)有限公司 音频信号处理方法、装置、电子设备及存储介质
CN111464701B (zh) * 2020-03-12 2021-05-07 云知声智能科技股份有限公司 一种对机器人外呼电话系统进行模拟多轮测试方法及装置
CN111178547B (zh) * 2020-04-10 2020-07-17 支付宝(杭州)信息技术有限公司 一种基于隐私数据进行模型训练的方法及系统
US11521595B2 (en) * 2020-05-01 2022-12-06 Google Llc End-to-end multi-talker overlapping speech recognition
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
CN111583938B (zh) * 2020-05-19 2023-02-03 威盛电子股份有限公司 电子装置与语音识别方法
CN111611808B (zh) * 2020-05-22 2023-08-01 北京百度网讯科技有限公司 用于生成自然语言模型的方法和装置
CN115769298A (zh) * 2020-06-02 2023-03-07 谷歌有限责任公司 缺少自动化助理应用编程接口功能的外部应用的自动化助理控制
CN111583913B (zh) * 2020-06-15 2020-11-03 深圳市友杰智新科技有限公司 语音识别和语音合成的模型训练方法、装置和计算机设备
CN114023309A (zh) * 2020-07-15 2022-02-08 阿里巴巴集团控股有限公司 语音识别系统、相关方法、装置及设备
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN111899759B (zh) * 2020-07-27 2021-09-03 北京嘀嘀无限科技发展有限公司 音频数据的预训练、模型训练方法、装置、设备及介质
CN112216306B (zh) * 2020-09-25 2024-08-02 广东电网有限责任公司佛山供电局 基于声纹的通话管理方法、装置、电子设备及存储介质
CN113012681B (zh) * 2021-02-18 2024-05-17 深圳前海微众银行股份有限公司 基于唤醒语音模型的唤醒语音合成方法及应用唤醒方法
KR20220126581A (ko) * 2021-03-09 2022-09-16 삼성전자주식회사 전자 장치 및 그 제어 방법
US20220293088A1 (en) * 2021-03-12 2022-09-15 Samsung Electronics Co., Ltd. Method of generating a trigger word detection model, and an apparatus for the same
CN113129870B (zh) * 2021-03-23 2022-03-25 北京百度网讯科技有限公司 语音识别模型的训练方法、装置、设备和存储介质
CN113763933B (zh) * 2021-05-06 2024-01-05 腾讯科技(深圳)有限公司 语音识别方法、语音识别模型的训练方法、装置和设备
US20220406324A1 (en) * 2021-06-18 2022-12-22 Samsung Electronics Co., Ltd. Electronic device and personalized audio processing method of the electronic device
CN113506566B (zh) * 2021-06-22 2022-04-15 荣耀终端有限公司 声音检测模型训练方法、数据处理方法以及相关装置
CN113411456B (zh) * 2021-06-29 2023-05-02 中国人民解放军63892部队 一种基于语音识别的话音质量评估方法及装置
CN113436633B (zh) * 2021-06-30 2024-03-12 平安科技(深圳)有限公司 说话人识别方法、装置、计算机设备及存储介质
CN113192520B (zh) * 2021-07-01 2021-09-24 腾讯科技(深圳)有限公司 一种音频信息处理方法、装置、电子设备及存储介质
CN113450800B (zh) * 2021-07-05 2024-06-21 上海汽车集团股份有限公司 一种唤醒词激活概率的确定方法、装置和智能语音产品
CN113282718B (zh) * 2021-07-26 2021-12-10 北京快鱼电子股份公司 一种基于自适应中心锚的语种识别方法及系统
CN113948091A (zh) * 2021-12-20 2022-01-18 山东贝宁电子科技开发有限公司 民航客机陆空通话语音识别引擎及其应用方法
CN114512136B (zh) * 2022-03-18 2023-09-26 北京百度网讯科技有限公司 模型训练、音频处理方法、装置、设备、存储介质及程序
CN115188371A (zh) * 2022-07-13 2022-10-14 合肥讯飞数码科技有限公司 一种语音识别模型训练方法、语音识别方法及相关设备
CN115376494B (zh) * 2022-08-29 2024-06-25 歌尔科技有限公司 一种语音检测方法、装置、设备及介质
CN115565537B (zh) * 2022-09-01 2024-03-15 荣耀终端有限公司 声纹识别方法及电子设备
CN115512696B (zh) * 2022-09-20 2024-09-13 中国第一汽车股份有限公司 模拟训练方法及车辆
CN115762513B (zh) * 2022-11-03 2024-07-16 深圳市品声科技有限公司 一种语音控制的分体式的无线音频通讯方法及系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152244A (zh) * 2013-01-30 2013-06-12 歌尔声学股份有限公司 一种控制即时通信平台通信的方法、装置和通信系统
CN103325381A (zh) * 2013-05-29 2013-09-25 吉林大学 一种基于模糊隶属函数的语音分离方法
CN106448680A (zh) * 2016-03-01 2017-02-22 常熟苏大低碳应用技术研究院有限公司 一种采用感知听觉场景分析的缺失数据特征说话人识别方法
CN106782565A (zh) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 一种声纹特征识别方法及系统
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN106920544A (zh) * 2017-03-17 2017-07-04 深圳市唯特视科技有限公司 一种基于深度神经网络特征训练的语音识别方法
CN107195295A (zh) * 2017-05-04 2017-09-22 百度在线网络技术(北京)有限公司 基于中英文混合词典的语音识别方法及装置
CN107808660A (zh) * 2016-09-05 2018-03-16 株式会社东芝 训练神经网络语言模型的方法和装置及语音识别方法和装置
CN108615535A (zh) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 语音增强方法、装置、智能语音设备和计算机设备
CN110176226A (zh) * 2018-10-25 2019-08-27 腾讯科技(深圳)有限公司 一种语音识别、及语音识别模型训练方法及装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8078463B2 (en) * 2004-11-23 2011-12-13 Nice Systems, Ltd. Method and apparatus for speaker spotting
JP5180928B2 (ja) * 2008-08-20 2013-04-10 本田技研工業株式会社 音声認識装置及び音声認識装置のマスク生成方法
JP5738020B2 (ja) * 2010-03-11 2015-06-17 本田技研工業株式会社 音声認識装置及び音声認識方法
US9280982B1 (en) * 2011-03-29 2016-03-08 Google Technology Holdings LLC Nonstationary noise estimator (NNSE)
EP2877992A1 (en) * 2012-07-24 2015-06-03 Nuance Communications, Inc. Feature normalization inputs to front end processing for automatic speech recognition
US9390712B2 (en) * 2014-03-24 2016-07-12 Microsoft Technology Licensing, Llc. Mixed speech recognition
CN105895080A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音识别模型训练方法、说话人类型识别方法及装置
CA3037090A1 (en) * 2016-10-24 2018-05-03 Semantic Machines, Inc. Sequence to sequence transformations for speech synthesis via recurrent neural networks
US10460727B2 (en) * 2017-03-03 2019-10-29 Microsoft Technology Licensing, Llc Multi-talker speech recognizer
CN107908660B (zh) * 2017-10-17 2021-07-09 东华大学 面向数据开放共享的数据划分与组织方法
CN108198569B (zh) * 2017-12-28 2021-07-16 北京搜狗科技发展有限公司 一种音频处理方法、装置、设备及可读存储介质
US10811000B2 (en) * 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
US11734328B2 (en) * 2018-08-31 2023-08-22 Accenture Global Solutions Limited Artificial intelligence based corpus enrichment for knowledge population and query response

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103152244A (zh) * 2013-01-30 2013-06-12 歌尔声学股份有限公司 一种控制即时通信平台通信的方法、装置和通信系统
CN103325381A (zh) * 2013-05-29 2013-09-25 吉林大学 一种基于模糊隶属函数的语音分离方法
US20170162194A1 (en) * 2015-12-04 2017-06-08 Conexant Systems, Inc. Semi-supervised system for multichannel source enhancement through configurable adaptive transformations and deep neural network
CN106448680A (zh) * 2016-03-01 2017-02-22 常熟苏大低碳应用技术研究院有限公司 一种采用感知听觉场景分析的缺失数据特征说话人识别方法
CN107808660A (zh) * 2016-09-05 2018-03-16 株式会社东芝 训练神经网络语言模型的方法和装置及语音识别方法和装置
CN106782565A (zh) * 2016-11-29 2017-05-31 重庆重智机器人研究院有限公司 一种声纹特征识别方法及系统
CN106920544A (zh) * 2017-03-17 2017-07-04 深圳市唯特视科技有限公司 一种基于深度神经网络特征训练的语音识别方法
CN107195295A (zh) * 2017-05-04 2017-09-22 百度在线网络技术(北京)有限公司 基于中英文混合词典的语音识别方法及装置
CN108615535A (zh) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 语音增强方法、装置、智能语音设备和计算机设备
CN110176226A (zh) * 2018-10-25 2019-08-27 腾讯科技(深圳)有限公司 一种语音识别、及语音识别模型训练方法及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113889088A (zh) * 2021-09-28 2022-01-04 北京百度网讯科技有限公司 训练语音识别模型的方法及装置、电子设备和存储介质
CN113889088B (zh) * 2021-09-28 2022-07-15 北京百度网讯科技有限公司 训练语音识别模型的方法及装置、电子设备和存储介质
CN113963715A (zh) * 2021-11-09 2022-01-21 清华大学 语音信号的分离方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN110428808A (zh) 2019-11-08
CN110288979A (zh) 2019-09-27
US20210043190A1 (en) 2021-02-11
CN110176226B (zh) 2024-02-02
CN110176226A (zh) 2019-08-27
CN110364144B (zh) 2022-09-02
CN110288979B (zh) 2022-07-05
EP3767619A4 (en) 2021-09-08
CN110428808B (zh) 2022-08-19
US11798531B2 (en) 2023-10-24
EP3767619A1 (en) 2021-01-20
CN110288978A (zh) 2019-09-27
CN110288978B (zh) 2022-08-30
CN110364144A (zh) 2019-10-22

Similar Documents

Publication Publication Date Title
WO2020083110A1 (zh) 一种语音识别、及语音识别模型训练方法及装置
WO2021135577A1 (zh) 音频信号处理方法、装置、电子设备及存储介质
US10956771B2 (en) Image recognition method, terminal, and storage medium
CN108735209B (zh) 唤醒词绑定方法、智能设备及存储介质
CN107481718B (zh) 语音识别方法、装置、存储介质及电子设备
CN110890093B (zh) 一种基于人工智能的智能设备唤醒方法和装置
TWI619114B (zh) 環境敏感之自動語音辨識的方法和系統
CN109558512B (zh) 一种基于音频的个性化推荐方法、装置和移动终端
US12106768B2 (en) Speech signal processing method and speech separation method
CN110570840B (zh) 一种基于人工智能的智能设备唤醒方法和装置
CN108711430B (zh) 语音识别方法、智能设备及存储介质
CN109192210B (zh) 一种语音识别的方法、唤醒词检测的方法及装置
WO2021114847A1 (zh) 网络通话方法、装置、计算机设备及存储介质
CN113160815B (zh) 语音唤醒的智能控制方法、装置、设备及存储介质
US20220309088A1 (en) Method and apparatus for training dialog model, computer device, and storage medium
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
CN111368525A (zh) 信息搜索方法、装置、设备及存储介质
CN113129867B (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
CN110491373A (zh) 模型训练方法、装置、存储介质及电子设备
CN113192537B (zh) 唤醒程度识别模型训练方法及语音唤醒程度获取方法
CN111046742B (zh) 一种眼部行为检测方法、装置以及存储介质
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
CN112488157A (zh) 一种对话状态追踪方法、装置、电子设备及存储介质
CN112771608A (zh) 语音信息的处理方法、装置、存储介质及电子设备
CN114328811A (zh) 一种数据处理方法、装置、设备、存储介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19874914

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019874914

Country of ref document: EP

Effective date: 20201015

NENP Non-entry into the national phase

Ref country code: DE