Nothing Special   »   [go: up one dir, main page]

WO2021093449A1 - 基于人工智能的唤醒词检测方法、装置、设备及介质 - Google Patents

基于人工智能的唤醒词检测方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021093449A1
WO2021093449A1 PCT/CN2020/115800 CN2020115800W WO2021093449A1 WO 2021093449 A1 WO2021093449 A1 WO 2021093449A1 CN 2020115800 W CN2020115800 W CN 2020115800W WO 2021093449 A1 WO2021093449 A1 WO 2021093449A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
syllable
probability vector
confidence
text
Prior art date
Application number
PCT/CN2020/115800
Other languages
English (en)
French (fr)
Inventor
陈杰
苏丹
金明杰
朱振岭
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021093449A1 publication Critical patent/WO2021093449A1/zh
Priority to US17/483,617 priority Critical patent/US11848008B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • G10L15/197Probabilistic grammars, e.g. word n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application generally relates to the field of speech recognition technology, and particularly relates to artificial intelligence-based wake word detection methods, devices, equipment, and media.
  • ASR automatic speech recognition technology
  • TTS speech synthesis technology
  • voiceprint recognition technology Enabling computers to be able to listen, see, speak, and feel is the future development direction of human-computer interaction, among which voice has become one of the most promising human-computer interaction methods in the future.
  • KSS KeyWord Spotting
  • KSS KeyWord Spotting
  • the voice recognition function on the terminal will be in the working state, otherwise it will be in the dormant state.
  • the recognition result is output through an acoustic model constructed based on a deep neural network.
  • the acoustic model is trained according to the syllables or phonemes corresponding to the wake-up words fixedly set, but it does not support the modification of the wake-up words.
  • the prior art also has a custom wake-up solution, such as a custom wake-up word solution based on a hidden Markov model (Hidden Markov Model, HMM) model, which includes an acoustic model and
  • HMM decoding network has two parts.
  • the voice induction is input into the decoding network with a fixed window size, and then the Viterbi decoding algorithm is used to find the optimal decoding path.
  • the embodiment of the present application provides an artificial intelligence-based wake word detection method, which includes:
  • a preset pronunciation dictionary is used to construct at least one syllable combination sequence for a user-defined wake-up word text input through a user interface, the pronunciation dictionary contains pronunciations corresponding to multiple text elements, and the syllable combination sequence is the wake-up word Orderly combination of multiple syllables corresponding to multiple text elements of the text;
  • the deep neural network model Inputting the voice features into a pre-built deep neural network model, and outputting the voice features corresponding to the posterior probability vector of the syllable identification, the deep neural network model including the same number of syllable output units as the number of syllables of the pronunciation dictionary;
  • a target probability vector from a posterior probability vector according to the syllable combination sequence, the target probability vector including a posterior probability value corresponding to each text element in the wake word text determined according to the posterior probability vector;
  • the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to the threshold, it is determined that the speech frame contains the wake word text.
  • the embodiment of the application provides an artificial intelligence-based wake word detection device, which includes:
  • the wake word setting unit is configured to use a preset pronunciation dictionary to construct at least one syllable combination sequence for the user-defined wake word text input by the user through the user interface, and the pronunciation dictionary contains pronunciations corresponding to multiple text elements, and the syllables
  • the combination sequence is an ordered combination of multiple syllables corresponding to multiple text elements of the wake word text;
  • the voice feature extraction unit is used to obtain the voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized;
  • the voice feature recognition unit is used to input voice features into a pre-built deep neural network model, and output the voice feature corresponding to the posterior probability vector of syllable identification.
  • the deep neural network model includes the same number of syllables as the pre-built pronunciation dictionary Syllable output unit;
  • the confidence determination unit is configured to determine a target probability vector from a posterior probability vector according to the syllable combination sequence, where the target probability vector includes the corresponding text element in the wake word text determined according to the posterior probability vector The posterior probability value; then the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to the threshold, it is determined that the speech frame contains the wake word text.
  • the embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the method described in the embodiment of the present application when the processor executes the program.
  • the embodiment of the present application provides a wake word detection system based on artificial intelligence, including:
  • the first terminal device is configured to use a preset pronunciation dictionary to construct at least one syllable combination sequence for the user-defined wake word text input through the user interface, and provide the at least one syllable combination sequence to the second terminal device, so
  • the pronunciation dictionary includes pronunciations corresponding to multiple text elements, and the syllable combination sequence is an ordered combination of multiple syllables corresponding to the multiple text elements of the wake word text;
  • the second terminal device is configured to obtain the voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized;
  • Target probability vector Determining a target probability vector from the posterior probability vector according to the syllable combination sequence, where the target probability vector includes a posterior probability value corresponding to each text element in the wake word text determined according to the posterior probability vector;
  • the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to a threshold, it is determined that the speech frame contains the wake word text.
  • the embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and the computer program is used for:
  • the artificial intelligence-based wake word detection method, device, device, and medium recognize speech data by constructing a deep neural network model covering all syllables of the pronunciation dictionary, and then recognize the voice data according to the pre-input wake word text Extract the posterior probability value corresponding to the syllable identifier of the wake-up word text from the recognition result as the target probability vector, and then calculate the confidence level according to the target probability vector, and then make a decision on the confidence to determine whether the voice data contains the wake-up word text corresponding content.
  • the method provided by the embodiment of the present application has low computational complexity and fast response speed, and does not require special optimization and improvement for fixed wake-up words, which effectively improves the efficiency of wake-up detection.
  • Figure 1 shows a schematic diagram of a wake-up word application scenario provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of a wake word detection method provided by an embodiment of the present application
  • FIG. 3 shows a schematic flowchart of step 105 provided by an embodiment of the present application
  • FIG. 4 shows a schematic flowchart of step 105 provided by another embodiment of the present application.
  • FIG. 5 shows a schematic diagram of a wake word text input interface provided by an embodiment of the present application
  • Fig. 6 shows a schematic diagram of a syllable combination sequence provided by an embodiment of the present application
  • FIG. 7 shows a schematic structural diagram of a wake word detection device 700 provided by an embodiment of the present application.
  • FIG. 8 shows a schematic structural diagram of a confidence determination unit 703 provided by an embodiment of the present application.
  • FIG. 9 shows a schematic structural diagram of a confidence determination unit 703 according to another embodiment of the present application.
  • FIG. 10 shows a wake word detection system 1000 provided by an embodiment of the present application.
  • Fig. 11 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiments of the present application.
  • Fig. 1 shows a schematic diagram of an application scenario of a wake-up word provided by an embodiment of the present application.
  • the terminal device automatically detects the voice data around it, and recognizes whether there is a wake-up word from the voice data.
  • the terminal device can switch from an incomplete working state (such as a sleep state) To the working state, the terminal device can be awakened, so that the terminal device can work normally.
  • an incomplete working state such as a sleep state
  • the terminal device may be a mobile phone, a tablet, a notebook, a wireless speaker, a smart robot, a smart home appliance, etc. It can be a fixed terminal or a mobile terminal.
  • the terminal device may at least include a voice receiving device.
  • the voice receiving device can receive the voice data output by the user, and process the voice data to obtain data that can be identified and analyzed.
  • the terminal equipment may also include other devices, such as a processing device, which is used to perform intelligent recognition processing on voice data.
  • a processing device which is used to perform intelligent recognition processing on voice data.
  • the terminal device is waked up.
  • the voice wake-up technology can also be used to perform a wake-up operation on the application program installed in the terminal device in advance, so as to realize the convenient and quick start of the application program and reduce the operating procedures of the terminal operating system.
  • the deep neural network is trained and constructed according to the fixed awakening words.
  • the terminal device can extract the voice feature of the voice data, input it into the deep neural network model, and then output the posterior probability of the tag category of the voice feature corresponding to the fixed wake-up word.
  • the confidence is calculated according to the posterior probability, and the confidence is used to determine whether the voice data contains a fixed wake word.
  • the deep neural network needs to be retrained, so that the user cannot change the wake word arbitrarily.
  • this application proposes a new wake word detection method based on artificial intelligence.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • FIG. 2 shows a schematic flowchart of a wake word detection method provided by an embodiment of the present application.
  • the method can be executed by a processor.
  • the processor can be a processor.
  • the one or more computing devices may be one or more terminal devices and/or one or more servers. As shown in Figure 2, the method may include the following steps.
  • a preset pronunciation dictionary can be used to construct at least one syllable combination sequence for the user-defined wake-up word text input by the user through the user interface.
  • the pronunciation dictionary contains pronunciations corresponding to multiple text elements respectively.
  • the syllable combination sequence is an ordered combination of multiple syllables corresponding to multiple text elements of the wake word text.
  • Step 101 Acquire voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized.
  • the voice data around the terminal device can be monitored in real time or regularly, or the voice data can be obtained after receiving the voice wake-up trigger instruction.
  • the voice features of the voice data are extracted. For example, according to the FilterBank algorithm, the voice feature is extracted from each voice frame of the voice data. Then the speech features will be extracted and input into the trained deep neural network model for speech recognition.
  • Step 103 Input the voice features into a pre-built deep neural network model, and output the voice feature corresponding to the posterior probability vector of the syllable identification.
  • the deep neural network model includes the same number of syllable output units as the number of syllables in the pre-built pronunciation dictionary .
  • the pre-built deep neural network model is obtained by using all the syllables contained in the pre-built pronunciation dictionary to label the training data set and then training according to the deep learning algorithm.
  • the deep neural network model includes the same number of syllable output units as the number of syllables of the pre-built pronunciation dictionary.
  • All the syllables contained in the pre-built pronunciation dictionary can be all pronunciations that can be collected according to pronunciation rules, such as Mandarin pronunciation rules, according to the pronunciation of Pinyin letters, the pronunciation of commonly used words can be about 1500.
  • the pronunciation of each word is used as a syllable identifier, and the speech data to be trained are labeled one by one, and then used as a training data set, where the syllable identifier is a category identifier used to identify the pronunciation of the character.
  • the deep neural network in the embodiment of the present application may be, for example, Deep Neural Networks (DNN), Convolution Neural Network (CNN), Long Short Term Memory (LSTM), etc.
  • DNN Deep Neural Networks
  • CNN Convolution Neural Network
  • LSTM Long Short Term Memory
  • the pre-training of the deep neural network model may include the following steps:
  • the input of the deep neural network model is the voice feature of each voice frame
  • the output of each syllable output unit is the output of each voice feature relative to the syllable output
  • the posterior probability vector output by the above-mentioned deep neural network model includes the same posterior probability value as the number of syllable identifiers included in the pronunciation dictionary. For example, if the input voice data includes "happy", then each syllable "open” or "xin” is the posterior probability value of each syllable identifier contained in the pronunciation dictionary. Taking the pronunciation dictionary storing 1500 pronunciations as an example, after each voice feature is input to the deep neural network, the output posterior probability vector is 1500 dimensions, and each dimension corresponds to a syllable identifier in the pronunciation dictionary.
  • Step 105 Determine the target probability vector from the posterior probability vector according to the syllable combination sequence.
  • the syllable combination sequence is constructed based on the input wake word text.
  • the target probability vector includes a posterior probability value corresponding to each text element (for example, character, word, phrase, etc.) in the wake word text determined according to the posterior probability vector. Then the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to the threshold, it is determined that the speech frame contains the wake word text.
  • FIG. 5 shows a schematic diagram of a wake word text input interface provided by an embodiment of the present application.
  • the user can modify the wake word text arbitrarily in this interface.
  • each character contained in the wake word text is converted into a syllable identifier by searching the pronunciation dictionary; the mapping relationship between the syllable identifier and the characters contained in the wake word text is constructed, and the mapping relationship is used as a syllable Combination sequence.
  • FIG. 6 shows a schematic diagram of the syllable combination sequence provided by an embodiment of the present application.
  • the wake word text is Chinese
  • each Chinese character is a character
  • the pronunciation of each character corresponds to a syllable identifier.
  • the pronunciation of the character " ⁇ " shown in FIG. 6 can be the third or fourth tone, and each pronunciation is assigned an identifier (Identifier) to serve as a syllable identifier.
  • the wake word text is "You are so happy”
  • the converted syllable combination sequence is ⁇ ID n1 , ID n2 , ID n3 , ID n4 , ID n5 ⁇ .
  • the syllable identification of the syllable can also be set in the processing process by default selection rules, for example, the corresponding syllable identification of the polyphonic word is determined according to the semantic relationship.
  • Constructing a syllable combination sequence according to the inputted wake word text can include the following steps:
  • mapping relationship between the syllable identifier and the characters contained in the wake word text is constructed, and the mapping relationship is used as a syllable combination sequence.
  • the inputted wake word text can be implemented on the terminal device, as shown in Fig. 5, or it can be implemented on other terminal devices.
  • terminal devices can implement the update operation of the wake word text.
  • the terminal device may communicate with the server or network side via a wireless or wired network, and provide the server with a custom wake-up word input by the user through the input device of the terminal device (such as a touch screen, a microphone, etc.).
  • the terminal device may receive the syllable combination sequence corresponding to the wake-up word from the server for the detection of the wake-up word.
  • the terminal device may provide the voice data to be recognized to the server, and obtain a wake word detection result from the server, and the detection result indicates whether the voice data to be recognized includes a user-defined wake word.
  • the wake word detection system may include:
  • the first terminal device is configured to use a preset pronunciation dictionary to construct at least one syllable combination sequence for the user-defined wake word text input through the user interface, and provide the at least one syllable combination sequence to the second terminal device, so
  • the pronunciation dictionary includes pronunciations corresponding to multiple text elements, and the syllable combination sequence is an ordered combination of multiple syllables corresponding to the multiple text elements of the wake word text;
  • the second terminal device is configured to obtain the voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized;
  • Target probability vector Determining a target probability vector from the posterior probability vector according to the syllable combination sequence, where the target probability vector includes a posterior probability value corresponding to each text element in the wake word text determined according to the posterior probability vector;
  • the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to a threshold, it is determined that the speech frame contains the wake word text.
  • a terminal device for example, a smart home appliance, such as a smart speaker, etc.
  • the other devices here may also be other terminal devices and/or servers.
  • the terminal device may communicate with a second terminal device (for example, a mobile phone, a tablet computer, etc.) through a wireless or wired network, and obtain the user's input through the input device (for example, a touch screen, a microphone, etc.) of the second terminal device. Define the wake word.
  • the terminal device can use the custom wake-up word to detect the wake-up word, or provide the custom wake-up word to the server to obtain the syllable combination sequence corresponding to the wake-up word for the detection of the wake-up word, or provide the voice data to be recognized to the server , And get the wake word detection result from the server.
  • the terminal device may obtain the syllable combination sequence corresponding to the custom wake-up word provided by the second terminal device (for example, mobile phone, tablet computer, etc.) from the server through a wireless or wired network, or provide the voice data to be recognized to the server , And get the wake word detection result from the server.
  • the second terminal device needs to provide the custom wake-up word input by the user through the input device of the second terminal device (such as a touch screen, a microphone, etc.) to the server in advance through the connection between the second terminal device and the server.
  • the server may determine the wake-up word set by the second terminal device to be used by the terminal device through the user identifier or the preset association relationship between the terminal devices.
  • the target probability vector is determined from the posterior probability vector according to the syllable combination sequence, where the target probability vector includes the same posterior probability value as the number of characters contained in the wake word text.
  • the target probability vector is extracted from the posterior probability vector according to the syllable identifier contained in the syllable combination sequence. If the syllable combination sequence contains polyphonic characters, the posterior probability values related to multiple syllable identifiers corresponding to the polyphonic characters can be calculated according to the processing method described in FIG. 4 to calculate the confidence.
  • the user when the user inputs and sets the wake-up word text, the user may choose to determine the pronunciation (ie, syllable identification) of the polysyllabic characters contained in the wake-up word text.
  • the pronunciation (ie, syllable identifier) of the polysyllabic character contained in the wake word text can also be determined by the default determination rule of the system.
  • the wake-up word text can be detected and analyzed first to determine whether the wake-up word text contains polyphones.
  • the system defaults to set the processing of polyphones.
  • the syllable identification of the polyphonic character is determined, and after the polyphonic word processing is performed on the wake word text, a syllable combination sequence corresponding to the wake word text is constructed.
  • the target probability vector can be determined from the posterior probability vector according to the syllable combination sequence, and the confidence level can be calculated directly according to the target probability vector.
  • the posterior probability value corresponding to the syllable identifier contained in the syllable combination sequence is obtained from the posterior probability vector, namely ⁇ P IDn1 , P IDn2 , P IDn3 , P IDn4 , P IDn5 ⁇ .
  • a 4-dimensional target probability vector is obtained according to the processing method described in FIG. 4, and the posterior probability value contained in the target probability vector is the same as the number of characters in the wake-up word text.
  • the confidence is calculated according to the target probability vector, and it is judged whether the confidence is greater than or equal to the set threshold. If it is greater than or equal, it is considered that the voice data contains the wake-up word text, and if it is less than, it is considered that the voice data does not contain the wake-up word text.
  • the confidence level can be calculated according to the following formula:
  • n the number of output units of the deep neural network model
  • p′ ik the posterior probability of the k-th frame output by the smoothed i-th output unit
  • w max can be determined by the number of frames that can be set. For example, w max takes 100 frames.
  • the threshold is adjustable in order to balance the final wake-up performance.
  • the embodiments of the present application can also directly construct a syllable combination sequence based on the wake word text after the wake word text entered by the user, and determine the target probability vector from the posterior probability vector according to the syllable combination sequence. In the process of calculating the confidence of the target probability vector, it is determined whether there are polyphonic characters in the wake word text.
  • calculating the confidence level according to the target probability vector may include:
  • the mapping relationship between the syllable identifier contained in the syllable combination sequence and the characters contained in the wake-up word text determine whether the wake-up word text contains polyphones
  • the confidence level is calculated according to the target probability vector after probability processing.
  • the target probability vector after probability processing is summed according to the corresponding relationship of the polyphonic characters
  • the probability processing step and the polyphone character determination step may occur simultaneously, or the probability processing step may be performed first, followed by the polyphone character determination step, or the polyphone character determination step may be performed first, and then the probability processing step may be performed.
  • the speech data is recognized by constructing a deep neural network model covering all the syllables of the pronunciation dictionary, and then the posterior corresponding to the syllable identification of the wake-up word text is extracted from the recognition result according to the pre-input wake-up word text
  • the probability value is used as the target probability vector, and after calculating the confidence degree according to the target probability vector, the confidence degree is judged to determine whether the voice data contains the content corresponding to the wake word text.
  • the embodiment of the present application further proposes a method for optimizing the confidence determination step.
  • FIG. 3 shows a schematic flowchart of step 105 provided by an embodiment of the present application. As shown in Figure 3, this step may include:
  • Step 301 For each posterior probability value in the target probability vector, determine whether it is lower than its corresponding prior probability value, and when the posterior probability value is lower than its corresponding prior probability value, set the posterior probability Set the value to 0; when the posterior probability value is not lower than its corresponding prior probability value, the posterior probability value is not processed;
  • Step 302 Divide the processed posterior probability value by the corresponding prior probability value to obtain the processed target probability vector
  • Step 303 Perform smoothing processing on the processed target probability vector
  • Step 304 Calculate the confidence level according to the smoothed target probability vector.
  • the target probability vector is a set of posterior probability values that are the same as the number of characters in the wake-up word text after the pronunciation of the polyphonic character is determined according to the default rule or user selection. For example, ⁇ P IDn1 , P IDn2 , P IDn4 , P IDn5 ⁇ .
  • the prior probability value of each syllable identification can be obtained through statistical analysis of the training data set. For example, according to the training data set used to train the deep neural network model, the prior probability distribution of all syllable output units can be obtained. Among them, the prior probability is used to characterize the probability that the syllable identifier corresponding to the syllable output unit appears in the training data set.
  • the deep neural network model After the deep neural network model outputs the posterior probability vector, it is necessary to extract the target probability vector from the posterior probability vector according to the syllable combination sequence. Then, a posterior filtering process is performed on each posterior probability value in the target probability vector.
  • the posterior filtering process refers to comparing the posterior probability of each dimension obtained by the extraction with the corresponding prior probability. If it is lower than the prior probability, then the posterior probability is set to zero, if it is not lower than the prior Probability, the posterior probability is not processed.
  • the embodiment of the present application Since in the posterior probability vector output by the deep neural network model, other syllable output units other than the current syllable output unit may also get a small probability (especially when the current content is noise), the embodiment of the present application, The above-mentioned posterior filtering processing can effectively reduce the impact of this part of the probability distribution on the wake-up performance, thereby optimizing the wake-up result.
  • the target probability vector after the posterior filtering process is divided, and each posterior probability value contained in it is divided by the corresponding prior probability value to obtain the revised posterior probability value.
  • This step is a priori processing step. Since the output of the posterior probability is usually related to the prior probability, that is, there are more pronunciation syllables in the training data. In the prediction, the posterior probability of outputting the pronunciation syllable will be larger, while the training data is more When there are fewer pronunciation syllables, the posterior probability corresponding to the output is smaller when predicting.
  • the embodiment provided in this application proposes to divide each posterior probability by the prior probability as the posterior probability value of the pronunciation syllable, so as to improve the robustness of the system and effectively improve the arousal that has a low probability of pronunciation. Word performance.
  • the posterior filtering process is performed on each posterior probability value in the target probability vector to reduce the impact of the wake-up performance of other syllable output units, and each posterior probability value after the posterior filtering process is performed A priori processing, which effectively optimizes the performance of wake-up detection and improves the accuracy of wake-up recognition.
  • step 105 shows a schematic flowchart of step 105 according to another embodiment of the present application.
  • This step can include:
  • Step 401 Perform probability processing on each posterior probability value contained in the target probability vector
  • Step 402 according to the mapping relationship between the syllable identifiers contained in the syllable combination sequence and the characters contained in the wake-up word text, determine whether the wake-up word text contains polysyllabic characters;
  • Step 403 When the wake-up word text does not contain polyphonic characters, calculate the confidence level according to the target probability vector after the probability processing;
  • Step 404 When the wake-up word text contains polyphonic characters, the target probability vector after probability processing is summed according to the corresponding relationship of the polyphonic characters;
  • Step 405 Calculate the confidence degree according to the target probability vector after the summation processing.
  • step 401 is the same as the method steps described in step 301 and step 302; the difference from the method steps shown in FIG.
  • the target probability vector of the current frame after the above processing is averaged with the results of multiple frames in a certain time window before, that is, the processed target probability vector is smoothed to reduce the interference caused by noise.
  • calculate the confidence level according to formula (1).
  • FIG. 4 The method described in FIG. 4 is the same as the content described in FIG. 3 when there are no polyphonic characters.
  • the wake-up word text contains polysyllabic characters according to the mapping relationship between the syllable identifier contained in the syllable combination sequence and the characters contained in the wake-up word text.
  • the word " ⁇ " corresponds to two syllable identifiers, indicating that there are polyphonic characters in the wake word text.
  • the result of determining whether there are polyphonic characters in the wake-up word text may also be realized by an indicator. For example, when it is determined that there are polyphonic characters in the wake word text, the indicator is marked. After the indicator symbols are used to identify the polyphonic characters, the confidence level can be calculated according to the method shown in FIG. 3, so as to determine whether the voice data to be recognized includes the wake-up word text.
  • step 405 may also include:
  • the embodiment of the application optimizes the performance of wake-up word detection and improves the accuracy of wake-up word detection through multi-phone word recognition processing.
  • each character contained in the wake word text is converted into a syllable identifier; the mapping relationship between the syllable identifier and the characters contained in the wake word text is constructed, and the mapping relationship is used as a syllable combination sequence, as shown in Figure 6.
  • a sequence of syllable combinations is assumed that the user enters "you are so happy" as the wake-up word text through the wake-up word text input interface as shown in FIG. 5.
  • the terminal device After completing the above operations, the terminal device starts a detection program to detect the voice data (also called sound data) around the terminal device. After detecting the input of voice data, the voice data is pre-emphasized, and the frame length is 25ms, and the frame is shifted by 10ms to obtain multiple voice frames after framing. By adding Hamming window processing, the voice is extracted according to the FilterBank algorithm The voice feature corresponding to each voice frame of the data.
  • the voice features are input to the pre-built deep neural network model, and the output voice features correspond to the posterior probability vector of the syllable identification.
  • the deep neural network model includes the same number of syllable output units as the number of syllables in the pre-built pronunciation dictionary. If the number of syllables in the pronunciation dictionary is assumed to be 1500, the posterior probability vector can be expressed as ⁇ P ID1 , P ID2 , P ID3 , P IDn1 , P IDn2 ...P IDm ⁇ , where the value of m is 1500; for example, P ID1 represents the posterior probability value of the speech feature relative to the syllable ID1.
  • the syllable combination sequence can be a set of posterior probability values filtered according to the polyphone rule selected by the user, or the posterior probability after processing according to the system default rules Value collection.
  • the target probability vector can be expressed as ⁇ P IDn1 , P IDn2 , P IDn4 , P IDn5 ⁇ .
  • the confidence is calculated according to the target probability vector, and when the confidence is greater than or equal to the threshold, it is determined that the speech frame contains the wake word text.
  • the user can initiate the operation of changing the wake word text at any time, for example, change the wake word text to "turn on”, refer to the above method to convert "turn on” into syllable identification to obtain the syllable combination sequence, and when the voice data is detected
  • the target probability vector is extracted from the posterior probability vector obtained by the speech data recognition through the syllable combination sequence, and then each posterior probability value in the target probability vector is processed in the manner of Fig. 3 or Fig. 4. Calculate the confidence level according to the processed target probability vector, and then determine whether the voice data contains the wake word text according to the confidence level. Wake up the terminal device when it is determined that the voice data contains the wake word text.
  • the wake word detection method provided by the embodiment of the present application has low computational complexity and can process the input frame by frame, so the method has a fast response speed.
  • FIG. 7 shows a schematic structural diagram of a wake word detection apparatus 700 provided by an embodiment of the present application.
  • the device 700 includes:
  • the voice feature extraction unit 701 is configured to obtain the voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized;
  • the voice feature recognition unit 702 is used to input voice features into a pre-built deep neural network model, and output the voice feature corresponding to the posterior probability vector of the syllable identification.
  • the deep neural network model includes the number of syllables in the pre-built pronunciation dictionary The same syllable output unit;
  • the confidence determination unit 703 is configured to determine the target probability vector from the posterior probability vector according to the syllable combination sequence, and calculate the confidence based on the target probability vector, and when the confidence is greater than or equal to the threshold, determine that the speech frame contains the wake word text.
  • the syllable combination sequence is constructed based on the input wake word text.
  • FIG. 8 shows a schematic structural diagram of a confidence determination unit 703 provided in an embodiment of the present application.
  • the confidence determination unit 703 further includes:
  • the posterior filtering subunit 801 is used to set the posterior probability value to 0 when each posterior probability value contained in the target probability vector is lower than its corresponding prior probability value; otherwise, the posterior probability value is not processed ;
  • the a priori processing subunit 802 is configured to divide each posterior probability value after the above processing by the corresponding prior probability value to obtain the processed target probability vector;
  • the first smoothing subunit 803 is configured to perform smoothing processing on the processed target probability vector
  • the first confidence calculation subunit 804 is configured to calculate the confidence based on the smoothed target probability vector.
  • Fig. 9 shows a schematic structural diagram of a confidence determination unit 703 according to another embodiment of the present application.
  • the confidence determination unit 703 further includes:
  • the probability processing subunit 901 is configured to perform probability processing on each posterior probability value contained in the target probability vector;
  • the polyphonic character determination subunit 902 is used to determine whether the wake-up word text contains polysyllabic characters according to the mapping relationship between the syllable identifier contained in the syllable combination sequence and the characters contained in the wake-up word text;
  • the first confidence calculation subunit 903 is configured to calculate the confidence based on the target probability vector after the probability processing when the wake-up word text does not contain polyphonic characters.
  • the confidence determination unit 703 further includes:
  • the second confidence calculation subunit 904 is used for summing the target probability vector after the probability processing according to the corresponding relationship of the polyphone when the wake word text contains polyphonic characters; calculate according to the target probability vector after the summing processing Confidence.
  • the probability processing subunit 901 may also include:
  • the posterior filtering module is used to set the posterior probability value to 0 when each posterior probability value contained in the target probability vector is lower than its corresponding prior probability value; otherwise, the posterior probability value is not processed;
  • the priori processing module is used to divide each posterior probability value after the above-mentioned processing by its corresponding prior probability value to obtain the processed target probability vector.
  • the first confidence calculation subunit 903 may further include:
  • Smoothing processing module used for smoothing the target probability vector after probability processing
  • the confidence calculation module is used to calculate the confidence based on the smoothed target probability vector.
  • the second confidence calculation subunit 904 may further include:
  • the probability summation module is used for summing the target probability vector after probability processing according to the corresponding relationship of the polyphonic characters
  • the smoothing processing module is used to smooth the target probability vector after the summation processing
  • the confidence calculation module is used to calculate the confidence based on the smoothed target probability vector.
  • the apparatus 700 may further include a network construction unit (not shown) for:
  • the input of the deep neural network model is the speech feature of each speech frame
  • the output of each syllable output unit is each The posterior probability value of each of the voice features relative to the syllable identifier corresponding to the syllable output unit.
  • the apparatus 700 may be implemented in a browser or other security application of an electronic device in advance, and may also be loaded into the browser of the electronic device or its security application by downloading or the like. Corresponding units in the apparatus 700 can cooperate with units in the electronic device to implement the solutions of the embodiments of the present application.
  • the embodiment of the present application also provides a wake word detection system.
  • FIG. 10 shows a wake word detection system 1000 provided by an embodiment of the present application.
  • the device 1000 includes a voice recognition unit 1001 and a wake-up word setting unit 1002.
  • the voice recognition unit 1001 can be set in the first terminal
  • the wake-up word setting unit 1002 can be set in the second terminal.
  • the first terminal and the second terminal may be connected in a wired or wireless manner.
  • the first terminal may be, for example, a wireless speaker
  • the second terminal may be, for example, a device such as a mobile phone or a tablet.
  • the speech recognition unit 1001 may include a structure as shown in FIG. 7.
  • the voice feature extraction unit is used to obtain the voice data to be recognized, and extract the voice feature of each voice frame in the voice data to be recognized;
  • the voice feature recognition unit is used to input voice features into a pre-built deep neural network model, and output the voice feature corresponding to the posterior probability vector of syllable identification.
  • the deep neural network model includes the same number of syllables as the pre-built pronunciation dictionary Syllable output unit;
  • the confidence determination unit is used to determine the target probability vector from the posterior probability vector according to the syllable combination sequence, and calculate the confidence based on the target probability vector, and when the confidence is greater than or equal to the threshold, it is determined that the speech frame contains the wake word text.
  • the combined sequence is constructed based on the input wake word text.
  • the wake word setting unit 1002 is used to obtain the input wake word text; convert each character contained in the wake word text into a syllable identifier by searching the pronunciation dictionary; construct the mapping relationship between the syllable identifier and the characters contained in the wake word text ,
  • the mapping relation is regarded as the syllable combination sequence.
  • the aforementioned voice recognition unit and wake word setting unit provided in this application can also be implemented in the same terminal.
  • FIG. 11 shows a schematic structural diagram of a terminal device or a server 1100 suitable for implementing embodiments of the present application.
  • the terminal device or server 1100 includes a central processing unit (CPU) 1101, which can be loaded into a random access memory (RAM) 1103 according to a program stored in a read only memory (ROM) 1102 or from a storage part 508 The program executes various appropriate actions and processing.
  • RAM random access memory
  • ROM read only memory
  • the CPU 1101, the ROM 1102, and the RAM 1103 are connected to each other through a bus 1104.
  • An input/output (I/O) interface 1105 is also connected to the bus 1104.
  • the following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, a mouse, etc.; an output portion 1107 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and speakers, etc.; a storage portion 1108 including a hard disk, etc. ; And a communication section 1109 including a network interface card such as a LAN card, a modem, and the like. The communication section 509 performs communication processing via a network such as the Internet.
  • the driver 1110 is also connected to the I/O interface 1105 as needed.
  • a removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1110 as needed, so that the computer program read therefrom is installed into the storage portion 1108 as needed.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a machine-readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication part 1109, and/or installed from the removable medium 1111.
  • CPU central processing unit
  • the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable removable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of the code, and the aforementioned module, program segment, or part of the code contains one or more for realizing the specified logical function.
  • Executable instructions may also occur in a different order from the order marked in the drawings. For example, two blocks shown in succession can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the units or modules involved in the embodiments described in the present application can be implemented in software or hardware.
  • the described unit or module may also be provided in the processor, for example, it may be described as: a processor includes a preprocessing module, a receiving module, a selection generating module, and a sending module.
  • a processor includes a preprocessing module, a receiving module, a selection generating module, and a sending module.
  • the names of these units or modules do not constitute a limitation on the units or modules themselves under certain circumstances.
  • the preprocessing module can also be described as "used to pre-assign a virtual identity to the first client, first Identification and at least one second identification unit".
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be included in the electronic device described in the above-mentioned embodiments; or it may exist alone without being assembled into the computer-readable storage medium.
  • the aforementioned computer-readable storage medium stores one or more programs, and when the aforementioned programs are used by one or more processors to execute the multiple electronic wallet compatible payment methods described in this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种基于人工智能的唤醒词检测方法、装置(700)、设备及存储介质。该方法包括:利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列;获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征(101);将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元(103);根据音节组合序列从后验概率向量中确定目标概率向量,并根据目标概率向量计算置信度,且在置信度大于等于阈值时确定语音帧包含唤醒词文本。其中,音节组合序列是根据输入的唤醒词文本构建的(105)。

Description

基于人工智能的唤醒词检测方法、装置、设备及介质
本申请要求于2019年11月14日提交中国专利局、申请号为201911124453.4、发明名称为“基于人工智能的唤醒词检测方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请一般涉及语音识别技术领域,尤其涉及基于人工智能的唤醒词检测方法、装置、设备及介质。
发明背景
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。将语音技术应用于电子设备,实现唤醒电子设备的功能,即语音唤醒技术。通常语音唤醒(KeyWord Spotting,KWS)是通过设定一个固定的唤醒词,在用户说出唤醒词之后,终端上的语音识别功能,才会处于工作状态,否则处于休眠状态。例如,通过基于深度神经网络构建的声学模型输出识别结果,该声学模型是按照固定设置的唤醒词对应的音节或音素训练的,但其不支持唤醒词的修改。
为了满足用户对唤醒词自定义的需求,现有技术也存在基于自定义唤醒方案,例如基于隐马尔可夫模型(Hidden Markov Model,HMM)模型的自定义唤醒词方案,该方案包括声学模型和HMM解码网络两部分,在唤醒词检测过程中,语音感召固定窗大小输入解码网络,然后利用维特比解码算法查找最优解码路径。
发明内容
鉴于现有技术中的上述缺陷或不足,期望提供一种基于人工智能的唤醒词检测方法、装置、设备及介质,在满足用户对唤醒词自定义的需求时,具有较低的计算复杂度和较快的响应速度。
本申请实施例提供了一种基于人工智能的唤醒词检测方法,该方法包括:
利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征;
将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与所述发音字典的音节的数量相同的音节输出单元;
根据所述音节组合序列从后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;
再根据目标概率向量计算置信度,且在置信度大于等于阈值时确定语音帧包含唤醒词文本。
本申请实施例提供了一种基于人工智能的唤醒词检测装置,该装置包括:
唤醒词设置单元,用于利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
语音特征提取单元,用于获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征;
语音特征识别单元,用于将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元;
置信度判决单元,用于根据所述音节组合序列从后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;再根据目标概率向量计算置信度,且在置信度大于等于阈值时确定语音帧包含唤醒词文本。
本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,该处理器执行该程序时实现如本申请实施例描述的方法。
本申请实施例提供了一种基于人工智能的唤醒词检测系统,包括:
第一终端设备,用于利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,并将所述至少一个音节组合序列提供给第二终端设备,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
第二终端设备,用于获取待识别语音数据,并提取所述待识别语音数据中每个语音帧的语音特征;
将所述语音特征输入到预先构建的深度神经网络模型,输出所述语音特征对应于音节标识的后验概率向量,所述深度神经网络模型包括与所述发音字典的音节的数量相同的音节输出单元;
根据所述音节组合序列从所述后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;
再根据所述目标概率向量计算置信度,且在所述置信度大于等于阈值时确定所述语音帧包含所述唤醒词文本。
本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序用于:
该计算机程序被处理器执行时实现如本申请实施例描述的方法。
本申请实施例提供的基于人工智能的唤醒词检测方法、装置、设备及介质,其通过构建覆盖发音字典的全部音节的深度神经网络模型来对语音数据进行识别,然后根据预先输入的唤醒词文本从识别结果中抽取与唤醒词文本的音节标识对应的后验概率值作为目标概率向量,再根据目标概率向量计算置信度后,对置信度进行判决以确定语音数据中是否包含唤醒词文本对应的内容。本申请实施例提供的方法,其计算复杂度低,且响应速度快,无需针对固定唤醒词进行专门优化改进,有效地提升了唤醒检测效率。
附图简要说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1示出了本申请实施例提供的唤醒词应用场景示意图;
图2示出了本申请实施例提供的唤醒词检测方法的流程示意图;
图3示出了本申请实施例提供的步骤105的流程示意图;
图4示出了本申请又一实施例提供的步骤105的流程示意图;
图5示出了本申请实施例提供的唤醒词文本输入界面的示意图;
图6示出了本申请实施例提供的音节组合序列的示意图;
图7示出了本申请实施例提供的唤醒词检测装置700的结构示意图;
图8示出了本申请实施例提供的置信度判断单元703的结构示意图;
图9示出了本申请又一实施例提供的置信度判断单元703的结构示意图;
图10示出了本申请实施例提供的唤醒词检测系统1000;
图11示出了适于用来实现本申请实施例的终端设备或服务器的结构示意图。
实施本发明的方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关公开,而非对该公开的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与公开相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了本申请实施例提供的唤醒词应用场景示意图。如图1所示,终端设备自动地检测其附近周围的语音数据,从语音数据中识别是否存在唤醒词,在存在唤醒词时,终端设备可以从不完全工作的状态(例如休眠状态),切换至工作状态,实现对终端设备的唤醒,使得终端设备正常工作。
其中,终端设备可以是手机、平板、笔记本、无线音箱、智能机器人、智能家电设备等。其可以是固定终端,也可以是移动终端。
终端设备可以至少包括语音接收装置。其中,语音接收装置可以接收用户输出的声音数据,对声音数据进行处理后得到可以识别分析的数据。
终端设备还可以包括其他装置,例如处理装置,该处理装置用于对语音数据进行智能识别处理。在智能识别处理过程中,如果语音数据包含预先设置的唤醒词,则唤醒终端设备。通过上述语音唤醒技术,可以减少终端的功耗,节省电力。进一步地,还可以通过语音唤醒技术,对预先安装在终端设备内的应用程序执行唤醒操作,从而实现对应用程序的方便快捷的启动,减少终端操作系统的操作程序。
在基于深度神经网络实现语音唤醒的场景中,深度神经网络是按照固定的唤醒词进行训练构建的。在检测到终端设备周围存在语音数据时,终端设备可以提取语音数据的语音特征,将其输入到深度神经网络模型,然后输出语音特征对应于固定的唤醒词的标签类别的后验概率。根据后验概率计算置信度,在通过置信度来判断语音数据是否包含固定的唤醒词。但是上述方法中如果要修改唤醒词,则需要对深度神经网络进行重新训练,以致用户不能随意地更改唤醒词。
为了解决上述问题,本申请提出了一种新的基于人工智能的唤醒词检测方法。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
请参考图2,图2示出了本申请实施例提供的唤醒词检测方法的流程示意图。该方法可以由处理器执行。该处理器可以是的处理器。一个或多个计算设备可以是一个或多个终端设备和/或一个或多个服务器。如图2所示,该方法可以包括以下步骤。
各实施例的方法允许用户自定义唤醒词。在检测用户自定义的唤醒词之前,可以利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列。所述发音词典包含多个文本 元素分别对应的发音。所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合。构建了用户自定义的唤醒词对应的音节组合序列后,就可以对用户自定义的唤醒词进行检测。
步骤101,获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征。
在本步骤中,可以通过实时或定时监听终端设备周围的语音数据,也可以在接收到语音唤醒触发指令之后来获取语音数据。在检测到有语音数据之后,提取语音数据的语音特征。例如,按照FilterBank算法,对语音数据的每个语音帧提取语音特征。然后将提取到语音特征,输入到已经训练好的深度神经网络模型进行语音识别。
步骤103,将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元。
在本步骤中,预先构建的深度神经网络模型,是利用预先构建的发音字典所包含的全部音节对训练数据集进行标注后,按照深度学习算法训练得到的。该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元。预先构建的发音字典所包含的全部音节,可以是按照发音规则可以收集到的所有发音,例如普通话发音规则,按照拼音字母组合发音,常用字的发音可以大约1500种。将每个字的发音作为音节标识,对待训练的语音数据进行逐个标注之后,作为训练数据集,其中音节标识是一种类别标识,用于标识该字符的读音。
本申请实施例的深度神经网络,例如可以是深度神经网络(DNN,Deep Neural Networks)、卷积神经网络(CNN,Convolution Neural Network)、长短时记忆网络(LSTM,Long Short Term Memory)等。
其中,预先训练深度神经网络模型可以包括以下步骤:
获取待训练的语音数据集,
对语音数据集中每个语音数据按照发音字典所包含的音节进行标注,得到训练数据集;
利用训练数据集对深度神经网络进行训练,以得到深度神经网络模型,该深度神经网络模型的输入是每个语音帧的语音特征,每个音节输出单元输出的是每个语音特征相对于音节输出单元对应的音节标识的后验概率值。
上述深度神经网络模型输出的后验概率向量包括与发音字典所 包含的音节标识的数量相同的后验概率值。例如,输入的语音数据中包含“开心”,则每个音节“开”或“心”分别相对于发音字典所包含的每个音节标识的后验概率值。以发音字典存有1500种发音为例,则每个语音特征输入到深度神经网络之后,输出的后验概率向量为1500维,其每一维与发音字典中的一个音节标识相对应。
步骤105,根据音节组合序列从后验概率向量中确定目标概率向量。其中,音节组合序列是根据输入的唤醒词文本构建的。目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素(例如字符、词、词组,等)对应的后验概率值。再根据目标概率向量计算置信度,且在置信度大于等于阈值时确定语音帧包含唤醒词文本。
在本步骤中,音节组合序列是根据输入的唤醒词文本构建的。如图5所示,图5示出了本申请实施例提供的唤醒词文本输入界面的示意图。用户可以在该界面中对唤醒词文本进行任意修改。在获取输入的唤醒词文本之后,通过查找发音词典将唤醒词文本所包含的每个字符转换成音节标识;构建音节标识与唤醒词文本所包含的字符之间的映射关系,该映射关系作为音节组合序列。其中,音节组合序列如图6所示,图6示出了本申请实施例提供的音节组合序列的示意图。其包括唤醒词文本所包含的字符,和该字符对应的音节标识。如果唤醒词文本为中文,则每个汉字为一个字符,每个字符的读音对应音节标识。例如,图6中示出的“好”字,其读音可以是第三声,也可以是第四声,每个读音分配一个标识ID(Identifier)用于作为音节标识。如图6所示,唤醒词文本为“你好开心”,其转换后的音节组合序列为{ID n1,ID n2,ID n3,ID n4,ID n5}。一些实施例中,还可以在图5所示的唤醒词文本输入界面接收用户输入的唤醒词文本之后,识别出唤醒词文本包括多音字,则提示用户确认多音字的读音,进而确定多音字对应的音节标识,也可以在处理过程中,设置默认的选择规则,例如对于多音字按照语义关系确定其对应的音节标识。
根据输入的唤醒词文本构建音节组合序列,可以包括以下步骤:
获取输入的唤醒词文本;
通过查找发音词典将唤醒词文本所包含的每个字符转换成音节标识;
构建音节标识与唤醒词文本所包含的字符之间的映射关系,该映射关系作为音节组合序列。
上述实施例中,输入的唤醒词文本可以在终端设备上实施,如图 5所示,也可以通过其他终端设备实施。
一些实施例中,终端设备(如手机、智能电器、智能音箱等),可以实施唤醒词文本的更新操作。例如,终端设备可以通过无线或有线网络与网络侧的服务器或进行通信,将用户通过终端设备的输入装置(例如触摸屏、拾音器等)输入的自定义唤醒词提供给服务器。一些例子中,终端设备可以从服务器接收唤醒词对应的音节组合序列用于唤醒词的检测。另一些例子中,终端设备可以将待识别语音数据提供给服务器,并从服务器获得唤醒词检测结果,检测结果指示该待识别语音数据是否包括用户自定义的唤醒词。
另一些实施例中,唤醒词检测系统可以包括:
第一终端设备,用于利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,并将所述至少一个音节组合序列提供给第二终端设备,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
第二终端设备,用于获取待识别语音数据,并提取所述待识别语音数据中每个语音帧的语音特征;
将所述语音特征输入到预先构建的深度神经网络模型,输出所述语音特征对应于音节标识的后验概率向量,所述深度神经网络模型包括与所述发音字典的音节的数量相同的音节输出单元;
根据所述音节组合序列从所述后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;
再根据所述目标概率向量计算置信度,且在所述置信度大于等于阈值时确定所述语音帧包含所述唤醒词文本。
例如,终端设备(例如,智能家电,如智能音箱等),可以通过与音箱通过无线或有线连接的其他设备,来实施唤醒词文本的更新操作。这里的其他设备也可以是其他终端设备和/或服务器。一些例子中,终端设备可以通过无线或有线网络与第二终端设备(例如,手机、平板电脑等)进行通信,获得用户通过该第二终端设备的输入装置(例如触摸屏、拾音器等)输入的自定义唤醒词。终端设备可以利用该自定义唤醒词进行唤醒词的检测,或将自定义唤醒词提供给服务器,从而获得唤醒词对应的音节组合序列用于唤醒词的检测,或者将待识别语音数据提供给服务器,并从服务器获得唤醒词检测结果。另一些例 子中,终端设备可以通过无线或有线网络从服务器获得第二终端设备(例如,手机、平板电脑等)提供的自定义唤醒词对应的音节组合序列,或者将待识别语音数据提供给服务器,并从服务器获得唤醒词检测结果。此时,第二终端设备需要事先将用户通过该第二终端设备的输入装置(例如触摸屏、拾音器等)输入的自定义唤醒词通过第二终端设备与服务器的连接提供给服务器。服务器可以通过用户标识或预设的终端设备之间的关联关系来确定该终端设备采用哪个第二终端设备设置的唤醒词。
本申请实施例,根据音节组合序列从后验概率向量中确定出目标概率向量,其中目标概率向量包括与唤醒词文本所包含的字符的数量相同的后验概率值。在深度神经网络模型输出后验概率向量之后,从该后验概率向量中按照音节组合序列所包含的音节标识抽取目标概率向量。如果音节组合序列中包含多音字,则可以将多音字对应的多个音节标识相关的后验概率值按照如图4描述的处理方法计算置信度。
一些实施例中,本申请实施例,还可以在用户输入设置唤醒词文本时,由用户选择确定唤醒词文本所包含的多音字的读音(即音节标识)。一些实施例中,本申请实施例中,还可以由系统默认的确定规则,确定唤醒词文本所包含的多音字的读音(即音节标识)。
例如,本申请实施例可以在用户输入的唤醒词文本之后,先对唤醒词文本进行检测分析,以确定唤醒词文本中是否包含多音字,在存在多音字时,按照系统默认设置多音字的处理规则,或者根据用户选择确定多音字的音节标识,对唤醒词文本进行多音字处理之后,构建与唤醒词文本对应的音节组合序列。这种情形下,可以根据音节组合序列从后验概率向量中确定目标概率向量,直接根据目标概率向量进行置信度计算。
假设从后验概率向量中获取与音节组合序列所包含的音节标识相对应后验概率值,即{P IDn1,P IDn2,P IDn3,P IDn4,P IDn5}。然后,按照如图4描述的处理方法得到4维的目标概率向量,目标概率向量中所包含的后验概率值与唤醒词文本的字符的数量相同。
然后,根据目标概率向量计算置信度,并判断置信度是否大于等于设置的阈值,如果大于等于,则认为语音数据中包含唤醒词文本,如果小于,则认为语音数据中不包含唤醒词文本。
其中,置信度可以按照如下公式进行计算:
Figure PCTCN2020115800-appb-000001
其中,n表示深度神经网络模型的输出单元个数,p′ ik表示平滑后的第i个输出单元输出的第k帧的后验概率,h max=max{1,j-w max+1}表示置信度计算窗w max中的第一帧的位置。w max可以由可设置数量的帧数决定。例如w max取100帧。在置信度判断过程中,阈值是可调整的,以便平衡最终唤醒性能。
在上述实施例基础上,本申请实施例还可以在用户输入的唤醒词文本之后,直接根据该唤醒词文本构建音节组合序列,根据音节组合序列从后验概率向量中确定目标概率向量,在根据目标概率向量计算置信度的过程中确定唤醒词文本是否存在多音字。
一些实施例中,根据目标概率向量计算置信度可以包括:
对目标概率向量所包含的每个后验概率值进行概率处理;
根据音节组合序列中所包含的音节标识与唤醒词文本所包含的字符之间的映射关系,确定唤醒词文本中是否包含多音字;
在唤醒词文本中不包含多音字时,根据概率处理后的目标概率向量计算置信度。
在唤醒词文本中包含多音字时,将概率处理后的目标概率向量按照多音字的对应关系进行求和处理;
根据求和处理后的目标概率向量计算置信度。
上述步骤中概率处理步骤和多音字确定步骤可以同步发生,也可以先进行概率处理步骤,后进行多音字确定步骤,或者,可以先进行多音字确定步骤,后进行概率处理步骤。
本申请实施例中,通过构建覆盖发音字典的全部音节的深度神经网络模型来对语音数据进行识别,然后根据预先输入的唤醒词文本从识别结果中抽取与唤醒词文本的音节标识对应的后验概率值作为目标概率向量,再根据目标概率向量计算置信度后,对置信度进行判决以确定语音数据中是否包含唤醒词文本对应的内容。上述方法,对于任意的唤醒词,无需进行专门优化即可获取较好的识别性能,且其具有算法复杂度低,响应时间短的优势。
在上述实施例基础上,本申请实施例进一步提出了对置信度判断 步骤进行优化的方法。请参考图3,图3示出了本申请实施例提供的步骤105的流程示意图。如图3所示,该步骤可以包括:
步骤301,对目标概率向量中的每个后验概率值,确定其是否低于其对应的先验概率值,并在后验概率值低于其对应的先验概率值时,将后验概率值置为0;在后验概率值不低于其对应的先验概率值时,不处理后验概率值;
步骤302,将经过上述处理后的后验概率值除以与其对应的先验概率值,得到处理后的目标概率向量;
步骤303,对处理后的目标概率向量进行平滑处理;
步骤304,根据平滑处理后的目标概率向量计算置信度。
在上述步骤中,目标概率向量是根据默认规则或者用户选择确定的多音字读音处理后,与唤醒词文本的字符的数量相同的后验概率值的集合。例如,{P IDn1,P IDn2,P IDn4,P IDn5}。
每个音节标识的先验概率值可以通过训练数据集进行统计分析得到。例如,根据训练深度神经网络模型所使用的训练数据集,可以得到所有音节输出单元的先验概率分布。其中,先验概率用于表征该音节输出单元对应的音节标识在训练数据集中出现的概率。
在深度神经网络模型输出后验概率向量之后,需要根据音节组合序列从后验概率向量中抽取出目标概率向量。然后,对目标概率向量中每个后验概率值进行后验过滤处理。后验过滤处理是指对于抽取得到的每一维后验概率,将其与对应的先验概率进行比较,如果低于先验概率,则将其后验概率置零,如果不低于先验概率,则不处理后验概率。由于在深度神经网络模型输出的后验概率向量中,当前音节输出单元之外的其他音节输出单元,也可能会得到一个很小的概率(特别是当前内容为噪声时),本申请实施例,通过上述后验过滤处理可以有效地减少这部分概率分布对唤醒性能带来的影响,从而优化唤醒结果。
将经过后验过滤处理后的目标概率向量,再将其所包含的每个后验概率值除以与之对应的先验概率值,得到修正后的后验概率值。这个步骤即先验处理步骤。由于后验概率的输出通常是与先验概率存在一定相关性的,即训练数据中存在较多的发音音节,在预测时,输出该发音音节的后验概率会较大,而训练数据中较少的发音音节,在预测时,输出与之对应的后验概率就较小。本申请提供的实施例,提出了利用每个后验概率除以先验概率,作为该发音音节的后验概率值, 以提升系统的鲁棒性,并有效地改善发音出现概率较小的唤醒词性能。
在上述实施例中,通过对目标概率向量中每个后验概率值进行后验过滤处理以减少其他音节输出单元唤醒性能产生的影响,并对经过后验过滤处理的每个后验概率值进行先验处理,其有效地优化了唤醒检测的性能,提升了唤醒识别的准确性。
进一步地,本申请实施例还提供了另一种对置信度判断步骤进行优化的方法,请参考图4,图4示出了本申请又一实施例提供的步骤105的流程示意图。该步骤可以包括:
步骤401,对目标概率向量所包含的每个后验概率值进行概率处理;
步骤402,根据音节组合序列中所包含的音节标识与唤醒词文本所包含的字符之间的映射关系,确定唤醒词文本中是否包含多音字;
步骤403,在唤醒词文本中不包含多音字时,根据概率处理后的目标概率向量计算置信度;
步骤404,在唤醒词文本中包含多音字时,将概率处理后的目标概率向量按照多音字的对应关系进行求和处理;
步骤405,根据求和处理后的目标概率向量计算置信度。
上述方法步骤中,步骤401与步骤301和步骤302描述的方法步骤相同;与图3所示的方法步骤的不同在于,对经过后验过滤处理和先验处理之后目标概率向量,在确定存在多音字时,需要对多音字对应的后验概率值进行合并处理,即将与该多音字对应的多个音节标识一一对应的后验概率值进行求和,将求和的结果作为该多音字的后验概率值。然后,将经过上述处理后的当前帧的目标概率向量与之前一定时间窗内的多帧结果求平均,即对处理后的目标概率向量进行平滑处理,以减少噪声带来的干扰。最后,按照公式(1)计算置信度。
图4描述的方法在不存在多音字时,与图3描述的内容相同。
在本申请实施例中,可以根据音节组合序列中所包含的音节标识与唤醒词文本所包含的字符之间的映射关系可以确定唤醒词文本中是否包含多音字。如图6所示,其中“好”字对应两个音节标识,表明唤醒词文本中存在多音字。一些实施例中,还可以将确定唤醒词文本是否存在多音字的结果,由指示符号来实现。例如,在确定唤醒词文本中存在多音字时,标记指示符号。采用指示符号标识多音字后,可以按照图3示出的方法来实现置信度的计算,从而确定待识别语音数据是否包括唤醒词文本。
其中,步骤405还可以包括:
对求和处理后的目标概率向量进行平滑处理;
根据平滑处理后的目标概率向量计算置信度。
本申请实施例,通过多音字识别处理,优化了唤醒词检测的性能,并提升了唤醒词检测的准确性。
为了更好地理解本申请,假设用户通过如图5所示的唤醒词文本输入界面,输入“你好开心”作为唤醒词文本。通过查找发音词典将唤醒词文本所包含的每个字符转换成音节标识;构建音节标识与唤醒词文本所包含的字符之间的映射关系,该映射关系作为音节组合序列,如图6所示的音节组合序列。
在完成上述操作之后,终端设备启动检测程序,检测终端设备周围的语音数据(也可以称为声音数据)。在检测到有语音数据输入之后,对语音数据进行预加重处理之后,按照帧长25ms,帧移10ms的进行分帧处理后得到多个语音帧,通过添加汉明窗处理,按照FilterBank算法提取语音数据每个语音帧对应的语音特征。
然后,将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,假设深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元,发音字典的音节的数量假设为1500种,则后验概率向量可以表示为{P ID1,P ID2,P ID3,P IDn1,P IDn2…P IDm},其中m的取值为1500;例如,P ID1表示语音特征相对于音节标识ID1的后验概率值。
根据音节组合序列从后验概率向量中确定目标概率向量,该音节组合序列可以是根据用户选择的多音字规则筛选后的后验概率值集合,或者,是按照系统默认规则处理后的后验概率值集合。例如,该目标概率向量可以表示为{P IDn1,P IDn2,P IDn4,P IDn5}。
并根据该目标概率向量计算置信度,且在置信度大于等于阈值时,确定语音帧包含唤醒词文本。
在上述操作过程中,用户可以随时发起变更唤醒词文本的操作,例如将唤醒词文本变更为“开机”,参照上述方法将“开机”转换成音节标识,得到音节组合序列,在检测到语音数据输入之后,通过该音节组合序列从语音数据识别得到的后验概率向量中抽取目标概率向量,再按照图3或图4的方式对目标概率向量中的每个后验概率值进行处理。根据处理后的目标概率向量计算置信度,再根据置信度确定语音数据是否包含唤醒词文本。在确定语音数据中包含唤醒词文本 时唤醒终端设备。
本申请实施例提供的唤醒词检测方法,具有计算复杂低,且其可以对输入进行逐帧处理,所以该方法的响应速度快。
应当注意,尽管在附图中以特定顺序描述了本公开方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。相反,流程图中描绘的步骤可以改变执行顺序。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。
进一步参考图7,图7示出了本申请实施例提供的唤醒词检测装置700的结构示意图。该装置700包括:
语音特征提取单元701,用于获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征;
语音特征识别单元702,用于将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元;
置信度判决单元703,用于根据音节组合序列从后验概率向量中确定目标概率向量,并根据目标概率向量计算置信度,且在置信度大于等于阈值时,确定语音帧包含唤醒词文本,该音节组合序列是根据输入的唤醒词文本构建的。
进一步参考图8,图8示出了本申请实施例提供的置信度判断单元703的结构示意图。该置信度判决单元703还包括:
后验过滤子单元801,用于在目标概率向量所包含的每个后验概率值低于其对应的先验概率值时,将后验概率值置为0;否则,不处理后验概率值;
先验处理子单元802,用于将经过上述处理后的每个后验概率值除以与其对应的先验概率值,得到处理后的目标概率向量;
第一平滑处理子单元803,用于对处理后的目标概率向量进行平滑处理;
第一置信度计算子单元804,用于根据平滑处理后的目标概率向量计算置信度。
进一步参考图9,图9示出了本申请又一实施例提供的置信度判 断单元703的结构示意图。该置信度判决单元703还包括:
概率处理子单元901,用于对目标概率向量所包含的每个后验概率值进行概率处理;
多音字确定子单元902,用于根据音节组合序列中所包含的音节标识与唤醒词文本所包含的字符之间的映射关系,确定唤醒词文本中是否包含多音字;
第一置信度计算子单元903,用于在唤醒词文本中不包含多音字时,根据概率处理后的目标概率向量计算置信度。
置信度判决单元703还包括:
第二置信度计算子单元904,用于在唤醒词文本中包含多音字时,将概率处理后的目标概率向量按照多音字的对应关系进行求和处理;根据求和处理后的目标概率向量计算置信度。
其中概率处理子单元901还可以包括:
后验过滤模块,用于在目标概率向量所包含的每个后验概率值低于其对应的先验概率值时,将后验概率值置为0;否则,不处理后验概率值;
先验处理模块,用于将经过上述处理后的每个后验概率值除以与其对应的先验概率值,得到处理后的目标概率向量。
第一置信度计算子单元903还可以包括:
平滑处理模块,用于对概率处理后的目标概率向量进行平滑处理;
置信度计算模块,用于根据平滑处理后的目标概率向量计算置信度。
第二置信度计算子单元904,还可以包括:
概率求和模块,用于将概率处理后的目标概率向量按照所述多音字的对应关系进行求和处理;
平滑处理模块,用于对求和处理后的目标概率向量进行平滑处理;
置信度计算模块,用于根据平滑处理后的目标概率向量计算置信度。
一些实施例中,装置700还可以包括网络构建单元(未示出),用于:
获取待训练的语音数据集;
对所述语音数据集中每个语音数据按照所述发音字典所包含的音节进行标注,得到训练数据集;
利用所述训练数据集对深度神经网络进行训练,以得到所述深度神经网络模型,所述深度神经网络模型的输入是每个语音帧的语音特征,每个所述音节输出单元输出的是每个所述语音特征相对于所述音节输出单元对应的音节标识的后验概率值。
应当理解,装置700中记载的诸单元或模块与参考图2描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作和特征同样适用于装置700及其中包含的单元,在此不再赘述。装置700可以预先实现在电子设备的浏览器或其他安全应用中,也可以通过下载等方式而加载到电子设备的浏览器或其安全应用中。装置700中的相应单元可以与电子设备中的单元相互配合以实现本申请实施例的方案。
在上文详细描述中提及的若干模块或者单元,这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
在上述实施例基础上,本申请实施例还提供了一种唤醒词检测系统。请参考图10,图10示出了本申请实施例提供的唤醒词检测系统1000。该装置1000包括语音识别单元1001和唤醒词设置单元1002,语音识别单元1001可以设置在第一终端内,唤醒词设置单元1002可以设置在第二终端内。第一终端和第二终端可以通过有线或无线方式连接。第一终端例如可以是无线音箱,第二终端例如可以手机、平板等设备。
其中,语音识别单元1001可以包括如图7所示的结构。语音特征提取单元,用于获取待识别语音数据,并提取待识别语音数据中每个语音帧的语音特征;
语音特征识别单元,用于将语音特征输入到预先构建的深度神经网络模型,输出语音特征对应于音节标识的后验概率向量,该深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元;
置信度判决单元,用于根据音节组合序列从后验概率向量中确定目标概率向量,并根据目标概率向量计算置信度,且在置信度大于等于阈值时,确定语音帧包含唤醒词文本,该音节组合序列是根据输入的唤醒词文本构建的。
唤醒词设置单元1002,用于获取输入的唤醒词文本;通过查找发音词典将唤醒词文本所包含的每个字符转换成音节标识;构建音节标识与唤醒词文本所包含的字符之间的映射关系,该映射关系作为音节组合序列。
本申请提供的上述语音识别单元和唤醒词设置单元也可以实施在同一终端中。
下面参考图11,其示出了适于用来实现本申请实施例的终端设备或服务器1100的结构示意图。
如图11所示,终端设备或服务器1100包括中央处理单元(CPU)1101,其可以根据存储在只读存储器(ROM)1102中的程序或者从存储部分508加载到随机访问存储器(RAM)1103中的程序而执行各种适当的动作和处理。在RAM 1103中,还存储有系统1100操作所需的各种程序和数据。CPU 1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。
以下部件连接至I/O接口1105:包括键盘、鼠标等的输入部分1106;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1107;包括硬盘等的存储部分1108;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1109。通信部分509经由诸如因特网的网络执行通信处理。驱动器1110也根据需要连接至I/O接口1105。可拆卸介质1111,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1110上,以便于从其上读出的计算机程序根据需要被安装入存储部分1108。
特别地,根据本公开的实施例,上文参考流程图图2描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在机器可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1109从网络上被下载和安装,和/或从可拆卸介质1111被安装。在该计算机程序被中央处理单元(CPU)1101执行时,执行本申请的系统中限定的上述功能。
需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只 读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,前述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中,例如,可以描述为:一种处理器包括预处理模块、接收模块、选择生成模块以及发送模块。其中,这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定,例如,预处理模块还可以被描述为“用于预先给第一客户端分配虚拟标识、第一标识和至少一个第二标识的单元”。
作为另一方面,本申请还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中的。上述计算机可读存储介质存储有一个或者多个程序,当上述前述程序被一个或者一个以上的处理器用来执行描述于本申请的多种电子钱包兼容支付方法。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离前述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (17)

  1. 一种基于人工智能的唤醒词检测方法,由一个或多个计算设备执行,该方法包括:
    利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
    获取待识别语音数据,并提取所述待识别语音数据中每个语音帧的语音特征;
    将所述语音特征输入到预先构建的深度神经网络模型,输出所述语音特征对应于音节标识的后验概率向量,所述深度神经网络模型包括与所述发音字典的音节的数量相同的音节输出单元;
    根据所述音节组合序列从所述后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;
    再根据所述目标概率向量计算置信度,且在所述置信度大于等于阈值时确定所述语音帧包含所述唤醒词文本。
  2. 根据权利要求1所述的基于人工智能的唤醒词检测方法,所述根据所述目标概率向量计算置信度包括:
    对所述目标概率向量所包含的每个后验概率值进行概率处理;
    根据所述音节组合序列中所包含的音节标识与所述唤醒词文本所包含的字符之间的映射关系,确定所述唤醒词文本中是否包含多音字;
    在所述唤醒词文本中不包含多音字时,根据概率处理后的目标概率向量计算置信度。
  3. 根据权利要求2所述的基于人工智能的唤醒词检测方法,所述根据所述目标概率向量计算置信度还包括:
    在所述唤醒词文本中包含多音字时,将所述概率处理后的目标概率向量按照所述多音字的对应关系进行求和处理;
    根据求和处理后的目标概率向量计算置信度。
  4. 根据权利要求2或3所述的基于人工智能的唤醒词检测方法,所述对所述目标概率向量所包含的每个后验概率值进行概率处理包括:
    在所述后验概率值低于其对应的先验概率值时,将所述后验概率值置为0;否则,不处理所述后验概率值;
    将经过上述处理后的所述后验概率值除以与其对应的先验概率值,得到处理后的目标概率向量。
  5. 根据权利要求2或3所述的基于人工智能的唤醒词检测方法,所述根据概率处理后的目标概率向量或所述根据求和处理后的目标概率向量计算置信度包括:
    对所述概率处理后的目标概率向量或所述求和处理后的目标概率向量进行平滑处理;
    根据平滑处理后的目标概率向量计算所述置信度。
  6. 根据权利要求1所述的基于人工智能的唤醒词检测方法,构建至少一个音节组合序列包括:
    获取用户输入的所述唤醒词文本;
    通过查找所述发音词典将所述唤醒词文本所包含的每个字符转换成所述音节标识;
    构建所述音节标识与所述唤醒词文本所包含的字符之间的映射关系,所述映射关系作为所述音节组合序列。
  7. 根据权利要求1所述的基于人工智能的唤醒词检测方法,进一步包括:
    获取待训练的语音数据集;
    对所述语音数据集中每个语音数据按照所述发音字典所包含的音节进行标注,得到训练数据集;
    利用所述训练数据集对深度神经网络进行训练,以得到所述深度神经网络模型,所述深度神经网络模型的输入是每个语音帧的语音特征,每个所述音节输出单元输出的是每个所述语音特征相对于所述音节输出单元对应的音节标识的后验概率值。
  8. 一种基于人工智能的唤醒词检测装置,其特征在于,该装置包括:
    唤醒词设置单元,用于利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
    语音特征提取单元,用于获取待识别语音数据,并提取所述待识别语音数据中每个语音帧的语音特征;
    语音特征识别单元,用于将所述语音特征输入到预先构建的深度神经网络模型,输出所述语音特征对应于音节标识的后验概率向量,所述深度神经网络模型包括与预先构建的发音字典的音节的数量相同的音节输出单元;
    置信度判决单元,用于根据所述音节组合序列从所述后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;再根据所述目标概率向量计算置信度,且在所述置信度大于等于阈值时确定所述语音帧包含所述唤醒词文本。
  9. 根据权利要求8所述的基于人工智能的唤醒词检测装置,所述置信度判决单元包括:
    概率处理子单元,用于对所述目标概率向量所包含的每个后验概率值进行概率处理;
    多音字确定子单元,用于根据所述音节组合序列中所包含的音节标识与所述唤醒词文本所包含的字符之间的映射关系,确定所述唤醒词文本中是否包含多音字;
    第一置信度计算子单元,用于在所述唤醒词文本中不包含多音字时,根据概率处理后的目标概率向量计算置信度。
  10. 根据权利要求9所述的基于人工智能的唤醒词检测装置,所述置信度判决单元进一步包括:
    第二置信度计算子单元,用于在所述唤醒词文本中包含多音字时,将所述概率处理后的目标概率向量按照所述多音字的对应关系进行求和处理;
    根据求和处理后的目标概率向量计算置信度。
  11. 根据权利要求9或10所述的基于人工智能的唤醒词检测装置,所述概率处理子单元包括:
    后验过滤模块,用于在所述后验概率值低于其对应的先验概率值时,将所述后验概率值置为0;否则,不处理所述后验概率值;
    先验处理模块,用于将经过上述处理后的所述后验概率值除以与其对应的先验概率值,得到处理后的目标概率向量。
  12. 根据权利要求9或10所述的基于人工智能的唤醒词检测装置,所述第一置信度计算子单元包括:
    平滑处理模块,用于对所述概率处理后的目标概率向量或所述求和处理后的目标概率向量进行平滑处理;
    置信度计算模块,用于根据平滑处理后的目标概率向量计算所述置信度。
  13. 根据权利要求8所述的基于人工智能的唤醒词检测装置,所述唤醒词设置单元用于:
    获取用户输入的所述唤醒词文本;
    通过查找所述发音词典将所述唤醒词文本所包含的每个字符转换成所述音节标识;
    构建所述音节标识与所述唤醒词文本所包含的字符之间的映射关系,所述映射关系作为所述音节组合序列。
  14. 根据权利要求8所述的基于人工智能的唤醒词检测装置,进一步包括网络构建单元,用于:
    获取待训练的语音数据集;
    对所述语音数据集中每个语音数据按照所述发音字典所包含的音节进行标注,得到训练数据集;
    利用所述训练数据集对深度神经网络进行训练,以得到所述深度神经网络模型,所述深度神经网络模型的输入是每个语音帧的语音特征, 每个所述音节输出单元输出的是每个所述语音特征相对于所述音节输出单元对应的音节标识的后验概率值。
  15. 一种基于人工智能的唤醒词检测系统,包括:
    第一终端设备,用于利用预设的发音词典为用户通过用户界面输入的自定义的唤醒词文本构建至少一个音节组合序列,并将所述至少一个音节组合序列提供给第二终端设备,所述发音词典包含多个文本元素分别对应的发音,所述音节组合序列是所述唤醒词文本的多个文本元素对应的多个音节的有序组合;
    第二终端设备,用于获取待识别语音数据,并提取所述待识别语音数据中每个语音帧的语音特征;
    将所述语音特征输入到预先构建的深度神经网络模型,输出所述语音特征对应于音节标识的后验概率向量,所述深度神经网络模型包括与所述发音字典的音节的数量相同的音节输出单元;
    根据所述音节组合序列从所述后验概率向量中确定目标概率向量,所述目标概率向量包括根据所述后验概率向量确定的所述唤醒词文本中各文本元素对应的后验概率值;
    再根据所述目标概率向量计算置信度,且在所述置信度大于等于阈值时确定所述语音帧包含所述唤醒词文本。
  16. 一种终端设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如权利要求1-7中任一项所述的方法。
  17. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-7中任一项所述的方法。
PCT/CN2020/115800 2019-11-14 2020-09-17 基于人工智能的唤醒词检测方法、装置、设备及介质 WO2021093449A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/483,617 US11848008B2 (en) 2019-11-14 2021-09-23 Artificial intelligence-based wakeup word detection method and apparatus, device, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911124453.4 2019-11-14
CN201911124453.4A CN110838289B (zh) 2019-11-14 2019-11-14 基于人工智能的唤醒词检测方法、装置、设备及介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/483,617 Continuation US11848008B2 (en) 2019-11-14 2021-09-23 Artificial intelligence-based wakeup word detection method and apparatus, device, and medium

Publications (1)

Publication Number Publication Date
WO2021093449A1 true WO2021093449A1 (zh) 2021-05-20

Family

ID=69576497

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/115800 WO2021093449A1 (zh) 2019-11-14 2020-09-17 基于人工智能的唤醒词检测方法、装置、设备及介质

Country Status (3)

Country Link
US (1) US11848008B2 (zh)
CN (1) CN110838289B (zh)
WO (1) WO2021093449A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254684A (zh) * 2021-06-18 2021-08-13 腾讯科技(深圳)有限公司 一种内容时效的确定方法、相关装置、设备以及存储介质
CN113450779A (zh) * 2021-06-23 2021-09-28 海信视像科技股份有限公司 语音模型训练数据集构建方法及装置
CN115132198A (zh) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质
CN115132195A (zh) * 2022-05-12 2022-09-30 腾讯科技(深圳)有限公司 语音唤醒方法、装置、设备、存储介质及程序产品
CN115132205A (zh) * 2022-06-27 2022-09-30 杭州网易智企科技有限公司 关键词检测方法、装置、设备及存储介质
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110838289B (zh) * 2019-11-14 2023-08-11 腾讯科技(深圳)有限公司 基于人工智能的唤醒词检测方法、装置、设备及介质
CN111210830B (zh) * 2020-04-20 2020-08-11 深圳市友杰智新科技有限公司 基于拼音的语音唤醒方法、装置和计算机设备
CN111833867B (zh) * 2020-06-08 2023-12-05 北京嘀嘀无限科技发展有限公司 语音指令识别方法、装置、可读存储介质和电子设备
CN111681661B (zh) * 2020-06-08 2023-08-08 北京有竹居网络技术有限公司 语音识别的方法、装置、电子设备和计算机可读介质
CN111739521B (zh) * 2020-06-19 2021-06-22 腾讯科技(深圳)有限公司 电子设备唤醒方法、装置、电子设备及存储介质
CN112071308A (zh) * 2020-09-11 2020-12-11 中山大学 一种基于语音合成数据增强的唤醒词训练方法
KR20220037846A (ko) * 2020-09-18 2022-03-25 삼성전자주식회사 음성 인식을 수행하기 위한 전자 장치를 식별하기 위한 전자 장치 및 그 동작 방법
CN112599127B (zh) * 2020-12-04 2022-12-30 腾讯科技(深圳)有限公司 一种语音指令处理方法、装置、设备及存储介质
CN112802452A (zh) * 2020-12-21 2021-05-14 出门问问(武汉)信息科技有限公司 垃圾指令识别方法及装置
EP4099319A4 (en) * 2020-12-28 2023-11-15 Beijing Baidu Netcom Science Technology Co., Ltd. WAKE INDEX MONITORING METHOD AND APPARATUS AND ELECTRONIC DEVICE
CN112767935B (zh) * 2020-12-28 2022-11-25 北京百度网讯科技有限公司 唤醒指标监测方法、装置及电子设备
CN113096650B (zh) * 2021-03-03 2023-12-08 河海大学 一种基于先验概率的声学解码方法
CN113555016A (zh) * 2021-06-24 2021-10-26 北京房江湖科技有限公司 语音交互方法、电子设备及可读存储介质
CN113470646B (zh) * 2021-06-30 2023-10-20 北京有竹居网络技术有限公司 一种语音唤醒方法、装置及设备
CN113327610B (zh) * 2021-06-30 2023-10-13 北京有竹居网络技术有限公司 一种语音唤醒方法、装置及设备
CN113450800B (zh) * 2021-07-05 2024-06-21 上海汽车集团股份有限公司 一种唤醒词激活概率的确定方法、装置和智能语音产品
CN113539266A (zh) * 2021-07-13 2021-10-22 盛景智能科技(嘉兴)有限公司 命令词识别方法、装置、电子设备和存储介质
CN113658593B (zh) * 2021-08-14 2024-03-12 普强时代(珠海横琴)信息技术有限公司 基于语音识别的唤醒实现方法及装置
CN113724710A (zh) * 2021-10-19 2021-11-30 广东优碧胜科技有限公司 语音识别方法及装置、电子设备、计算机可读存储介质
CN114299923B (zh) * 2021-12-24 2024-10-22 北京声智科技有限公司 音频识别方法、装置、电子设备及存储介质
CN114360510A (zh) * 2022-01-14 2022-04-15 腾讯科技(深圳)有限公司 一种语音识别方法和相关装置
CN114913853A (zh) * 2022-06-21 2022-08-16 北京有竹居网络技术有限公司 语音唤醒方法、装置、存储介质和电子设备
CN115223574B (zh) * 2022-07-15 2023-11-24 北京百度网讯科技有限公司 语音信息处理方法、模型的训练方法、唤醒方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及系统
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
CN108182937A (zh) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 关键词识别方法、装置、设备及存储介质
CN108615526A (zh) * 2018-05-08 2018-10-02 腾讯科技(深圳)有限公司 语音信号中关键词的检测方法、装置、终端及存储介质
CN109036412A (zh) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 语音唤醒方法和系统
CN109065044A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 唤醒词识别方法、装置、电子设备及计算机可读存储介质
US10468027B1 (en) * 2016-11-11 2019-11-05 Amazon Technologies, Inc. Connected accessory for a voice-controlled device
CN110838289A (zh) * 2019-11-14 2020-02-25 腾讯科技(深圳)有限公司 基于人工智能的唤醒词检测方法、装置、设备及介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9117449B2 (en) * 2012-04-26 2015-08-25 Nuance Communications, Inc. Embedded system for construction of small footprint speech recognition with user-definable constraints
US10719115B2 (en) * 2014-12-30 2020-07-21 Avago Technologies International Sales Pte. Limited Isolated word training and detection using generated phoneme concatenation models of audio inputs
TWI525532B (zh) * 2015-03-30 2016-03-11 Yu-Wei Chen Set the name of the person to wake up the name for voice manipulation
US10074363B2 (en) * 2015-11-11 2018-09-11 Apptek, Inc. Method and apparatus for keyword speech recognition
CN105632486B (zh) * 2015-12-23 2019-12-17 北京奇虎科技有限公司 一种智能硬件的语音唤醒方法和装置
CN106611597B (zh) * 2016-12-02 2019-11-08 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法和装置
CN108281137A (zh) * 2017-01-03 2018-07-13 中国科学院声学研究所 一种全音素框架下的通用语音唤醒识别方法及系统
CN107221326B (zh) * 2017-05-16 2021-05-28 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法、装置和计算机设备
CN107134279B (zh) * 2017-06-30 2020-06-19 百度在线网络技术(北京)有限公司 一种语音唤醒方法、装置、终端和存储介质
CN108122556B (zh) * 2017-08-08 2021-09-24 大众问问(北京)信息科技有限公司 减少驾驶人语音唤醒指令词误触发的方法及装置
CN109243428B (zh) * 2018-10-15 2019-11-26 百度在线网络技术(北京)有限公司 一种建立语音识别模型的方法、语音识别方法及系统
CN110444210B (zh) * 2018-10-25 2022-02-08 腾讯科技(深圳)有限公司 一种语音识别的方法、唤醒词检测的方法及装置
CN109767763B (zh) * 2018-12-25 2021-01-26 苏州思必驰信息科技有限公司 自定义唤醒词的确定方法和用于确定自定义唤醒词的装置
CN110033758B (zh) * 2019-04-24 2021-09-24 武汉水象电子科技有限公司 一种基于小训练集优化解码网络的语音唤醒实现方法
US11222622B2 (en) * 2019-05-05 2022-01-11 Microsoft Technology Licensing, Llc Wake word selection assistance architectures and methods
WO2020231181A1 (en) * 2019-05-16 2020-11-19 Samsung Electronics Co., Ltd. Method and device for providing voice recognition service
US11282500B2 (en) * 2019-07-19 2022-03-22 Cisco Technology, Inc. Generating and training new wake words
CN110364143B (zh) * 2019-08-14 2022-01-28 腾讯科技(深圳)有限公司 语音唤醒方法、装置及其智能电子设备
US20210050003A1 (en) * 2019-08-15 2021-02-18 Sameer Syed Zaheer Custom Wake Phrase Training
US11217245B2 (en) * 2019-08-29 2022-01-04 Sony Interactive Entertainment Inc. Customizable keyword spotting system with keyword adaptation

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106098059A (zh) * 2016-06-23 2016-11-09 上海交通大学 可定制语音唤醒方法及系统
US10468027B1 (en) * 2016-11-11 2019-11-05 Amazon Technologies, Inc. Connected accessory for a voice-controlled device
CN107123417A (zh) * 2017-05-16 2017-09-01 上海交通大学 基于鉴别性训练的定制语音唤醒优化方法及系统
CN108182937A (zh) * 2018-01-17 2018-06-19 出门问问信息科技有限公司 关键词识别方法、装置、设备及存储介质
CN108615526A (zh) * 2018-05-08 2018-10-02 腾讯科技(深圳)有限公司 语音信号中关键词的检测方法、装置、终端及存储介质
CN109065044A (zh) * 2018-08-30 2018-12-21 出门问问信息科技有限公司 唤醒词识别方法、装置、电子设备及计算机可读存储介质
CN109036412A (zh) * 2018-09-17 2018-12-18 苏州奇梦者网络科技有限公司 语音唤醒方法和系统
CN110838289A (zh) * 2019-11-14 2020-02-25 腾讯科技(深圳)有限公司 基于人工智能的唤醒词检测方法、装置、设备及介质

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254684A (zh) * 2021-06-18 2021-08-13 腾讯科技(深圳)有限公司 一种内容时效的确定方法、相关装置、设备以及存储介质
CN113450779A (zh) * 2021-06-23 2021-09-28 海信视像科技股份有限公司 语音模型训练数据集构建方法及装置
CN113450779B (zh) * 2021-06-23 2022-11-11 海信视像科技股份有限公司 语音模型训练数据集构建方法及装置
CN115132195A (zh) * 2022-05-12 2022-09-30 腾讯科技(深圳)有限公司 语音唤醒方法、装置、设备、存储介质及程序产品
CN115132195B (zh) * 2022-05-12 2024-03-12 腾讯科技(深圳)有限公司 语音唤醒方法、装置、设备、存储介质及程序产品
CN115132198A (zh) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质
CN115132198B (zh) * 2022-05-27 2024-03-15 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质
CN115132205A (zh) * 2022-06-27 2022-09-30 杭州网易智企科技有限公司 关键词检测方法、装置、设备及存储介质
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN116705058B (zh) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质

Also Published As

Publication number Publication date
CN110838289A (zh) 2020-02-25
US11848008B2 (en) 2023-12-19
US20220013111A1 (en) 2022-01-13
CN110838289B (zh) 2023-08-11

Similar Documents

Publication Publication Date Title
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
US11915699B2 (en) Account association with device
US10515627B2 (en) Method and apparatus of building acoustic feature extracting model, and acoustic feature extracting method and apparatus
WO2020182153A1 (zh) 基于自适应语种进行语音识别的方法及相关装置
CN112562691B (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
CN110310623B (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
WO2021051544A1 (zh) 语音识别方法及其装置
CN111312245B (zh) 一种语音应答方法、装置和存储介质
EP3113176B1 (en) Speech recognition
WO2018227781A1 (zh) 语音识别方法、装置、计算机设备及存储介质
WO2020238045A1 (zh) 智能语音识别方法、装置及计算机可读存储介质
JP2019035936A (ja) ニューラルネットワークを用いた認識方法及び装置並びにトレーニング方法及び電子装置
US11398219B2 (en) Speech synthesizer using artificial intelligence and method of operating the same
CN111833845A (zh) 多语种语音识别模型训练方法、装置、设备及存储介质
US20230127787A1 (en) Method and apparatus for converting voice timbre, method and apparatus for training model, device and medium
US11200888B2 (en) Artificial intelligence device for providing speech recognition function and method of operating artificial intelligence device
CN112151015A (zh) 关键词检测方法、装置、电子设备以及存储介质
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
US20210327407A1 (en) Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium
US20240013784A1 (en) Speaker recognition adaptation
JP6875819B2 (ja) 音響モデル入力データの正規化装置及び方法と、音声認識装置
Benelli et al. A low power keyword spotting algorithm for memory constrained embedded systems
JP7178394B2 (ja) 音声信号を処理するための方法、装置、機器、および媒体
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20887824

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20887824

Country of ref document: EP

Kind code of ref document: A1