Nothing Special   »   [go: up one dir, main page]

WO2017202016A1 - 语音唤醒方法和装置 - Google Patents

语音唤醒方法和装置 Download PDF

Info

Publication number
WO2017202016A1
WO2017202016A1 PCT/CN2016/111367 CN2016111367W WO2017202016A1 WO 2017202016 A1 WO2017202016 A1 WO 2017202016A1 CN 2016111367 W CN2016111367 W CN 2016111367W WO 2017202016 A1 WO2017202016 A1 WO 2017202016A1
Authority
WO
WIPO (PCT)
Prior art keywords
wake
words
inverse model
word
decoding
Prior art date
Application number
PCT/CN2016/111367
Other languages
English (en)
French (fr)
Inventor
袁斌
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Priority to US16/099,943 priority Critical patent/US10867602B2/en
Publication of WO2017202016A1 publication Critical patent/WO2017202016A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude

Definitions

  • the present application relates to the field of voice recognition technology, and in particular, to a voice wake-up method and apparatus.
  • Voice wake-up technology is a feature that has a switch entry attribute.
  • the user wakes up by voice, and can initiate the operation of human-computer interaction, that is, the machine can only recognize the next voice command of the user if it is awakened by the wake-up word mentioned by the user.
  • the present application aims to solve at least one of the technical problems in the related art to some extent.
  • an object of the present application is to propose a voice wake-up method.
  • Another object of the present application is to propose a voice waking device.
  • the voice waking method of the first aspect of the present application includes: acquiring a voice signal to be processed; and decoding the voice signal according to a pre-generated search space to obtain a voice recognition result, where
  • the search space includes a path of the inverse model, the inverse model includes a first inverse model, and the first inverse model is trained to generate according to a segmentation result of the wake-up word; when the preset number of the voice recognition result is obtained And determining whether the preset number of words includes at least part of the words in the wake-up word; if not, directly determining not to wake up, ending decoding of the voice signal.
  • the voice waking device of the second aspect of the present application includes: an acquiring module, configured to acquire a voice signal to be processed; and a decoding module, configured to perform the voice signal according to a pre-generated search space.
  • Decoding to obtain a speech recognition result wherein the search space includes a path of the inverse model, the inverse model includes a first inverse model, and the first inverse model is trained according to a segmentation result of the wake word;
  • the determining module is configured to: When the preset number of words of the voice recognition result is obtained, determining whether the preset number of words includes at least part of the words of the wake-up word; the first processing module is configured to: if not included, Then directly determining not to wake up, ending the decoding of the speech signal.
  • the false wake-up rate can be reduced.
  • Anti-noise capability can be improved by performing audio processing on the speech signal.
  • the scale of the inverse model can be reduced, so that it can be applied locally to the terminal to solve the problem of requiring full network connection.
  • wake-up by any wake-up word can be achieved.
  • the wake-up sensitivity can be improved by weighting the path of the wake-up word.
  • Power consumption can be reduced by directly ending the search for the abnormal path at the time of decoding.
  • the wakeup can be successfully completed in the sentence of the user's mixed wake-up words, and the wake-up precision is improved.
  • the embodiment of the present application further provides a terminal, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: acquire a voice signal to be processed; according to a pre-generated search space Decoding the speech signal to obtain a speech recognition result, wherein the search space includes a path of an inverse model, the inverse model includes a first inverse model, and the first inverse model is trained according to a word segmentation result of the wake word Generating; when acquiring a preset number of words in front of the speech recognition result, determining whether the preset number of words includes at least part of the words in the wake-up word; if not, directly determining not to wake up Ending the decoding of the speech signal.
  • the embodiment of the present application further provides a non-transitory computer readable storage medium, when the instructions in the storage medium are executed by a processor of the terminal, enabling the terminal to perform a method, the method comprising: acquiring a to-be-processed a voice signal; decoding the voice signal according to a pre-generated search space to obtain a voice recognition result, wherein the search space includes a path where the inverse model is located, and the inverse model includes a first inverse model, the first The inverse model is generated according to the segmentation result of the wake-up word; when the preset number of words in front of the voice recognition result is obtained, determining whether the preset number of words includes at least part of the wake-up words If not included, it is directly determined not to wake up, and the decoding of the speech signal is ended.
  • the embodiment of the present application further provides a computer program product, when an instruction in the computer program product is executed by a processor, performing a method, the method comprising: acquiring a voice signal to be processed; according to a pre-generated search Spatially, decoding the speech signal to obtain a speech recognition result, wherein the search space includes a path of an inverse model, the inverse model includes a first inverse model, and the first inverse model is based on a word segmentation result of the wake word
  • the training generates; when the preset number of words in front of the speech recognition result is obtained, determining whether the preset number of words includes at least part of the words in the wake-up word; if not, directly determining not Wake up, ending the decoding of the speech signal.
  • FIG. 1 is a schematic flowchart of a voice wake-up method according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice wake-up method according to another embodiment of the present application.
  • FIG. 3 is a schematic diagram of a search space in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present application.
  • Audio processing such as noise reduction, signal enhancement, etc., to solve the problem of poor noise immunity.
  • Weighting the path of the wake-up word can make the wake-up word easier to go to the path of the wake-up word to improve the wake-up sensitivity.
  • each embodiment of the present application is not limited to being able to solve all the technical problems perfectly, and solve at least one technical problem at least to some extent.
  • the voice wake-up technology of the present application can be specifically applied to an offline scenario, that is, a local application in a terminal.
  • the voice wake-up technology of the present application can also be applied to the server to implement online voice wake-up.
  • the terminal involved in the present application may be a terminal that can apply a voice wake-up technology, such as a mobile terminal, an in-vehicle terminal, an on-board terminal, and a desktop computer.
  • a voice wake-up technology such as a mobile terminal, an in-vehicle terminal, an on-board terminal, and a desktop computer.
  • FIG. 1 is a schematic flowchart of a voice wake-up method according to an embodiment of the present application.
  • This embodiment can solve the problem of high false wakeup rate and high power consumption at least to some extent.
  • the process of this embodiment includes:
  • the initial speech signal is entered by the user.
  • the voice signal input by the user may be subjected to audio processing to obtain a voice signal to be processed.
  • audio processing please refer to the subsequent description.
  • S12 Decode the voice signal according to a pre-generated search space to obtain a voice recognition result, where the search space includes a path where the inverse model is located, and the inverse model includes a first inverse mode Type, the first inverse model is trained to generate according to the word segmentation result of the wake word.
  • the speech signal may be extracted to obtain an acoustic feature, and then the acoustic path is used to perform an optimal path search in the search space, and the text corresponding to the optimal path is determined as speech recognition. result.
  • the search space includes multiple paths, which may include: a path where the wake-up word is located and a path where the inverse model is located, wherein the inverse model is used to guide the non-awake word to the path of the inverse model during speech decoding.
  • an inverse model is called a first inverse model, and the first inverse model is generated according to the training result of the word segmentation of the wake-up word. For example, if the wake-up word is “Baidu” (the wake-up word can be set), the wake-up word can be segmented first.
  • the word-dividing principle can be divided into first words, and when there are more than three words, two or two-part words, such as the word segmentation result is “ "Hundred”, "Baidu” and "Look”, you can participate in the training of the anti-model as "training data" to get the first inverse model.
  • the preset number of words is, for example, three
  • the first three words corresponding to the voice signal it can be determined whether the first three words include at least part of the wake words. For example, if the wake-up word is "Baidu”, it is judged whether the first three words are "Baidu One", or whether the last two words in the first three words are "Baidu”, or the first three words are judged. Is the last word in the word "hundred"?
  • the decoding of the voice signal is also directly ended, that is, since the voice signal is a segment of the signal, there may be one or more subsequent words in addition to the first three words.
  • the first 3 words are recognized, and the first 3 words are determined not to wake up, it is not necessary to continue to recognize the subsequent words, and the recognition of the speech signal is directly ended, so that power consumption can be reduced.
  • the false wake-up rate can be reduced.
  • directly determining that the wake-up is not awakened and ending the speech decoding can reduce the power consumption.
  • FIG. 2 is a schematic flowchart of a voice wake-up method according to another embodiment of the present application.
  • the problem that the false wake-up rate is high, the power consumption is high, the whole network is connected, the noise resistance is poor, the number of wake-up words is only one, and the wake-up sensitivity is poor can be solved at least to some extent.
  • the present embodiment provides a technical solution for solving a more comprehensive problem.
  • the application is not limited to the solution of the embodiment, and technical features for solving different technical problems may be separately formed into a technical solution, or Different technical features are combined in any of the other ways to obtain a new technical solution.
  • the process of this embodiment includes:
  • S201 The composition generates a search space.
  • the search space may include multiple paths, including the path 31 where the wake-up word is located and the path 32 where the inverse model is located.
  • the search space further includes a path 33 in which the wake-up word and the anti-model are connected in series.
  • the silent state (SIL) state may be directly entered, or the SIL state may be entered after the inverse model.
  • the path of the wake-up word may be weighted, that is, the weight of the path of the wake-up word is increased on the original basis, so that the wake-up word is more easily entered into the path of the wake-up word.
  • the wake-up words can be set to one or more.
  • the inverse model may include a first inverse model and a second inverse model, and may be formed by parallel connection of the first inverse model and the second inverse model, or weighted parallel connection.
  • the first inverse model is generated after training the word segmentation result of the wake-up word.
  • the second inverse model is not directly trained on the corpus, but is generated after training the clustering result of the corpus to reduce the scale of the second inverse model, and is more suitable for the terminal local.
  • some commonly used speech corpora may be used to cluster the syllables of the pronunciation, for example, clustering into 26, corresponding to 26 words, and then generating a second inverse model according to the 26 words, the second inverse
  • the model is a streamlined model.
  • search space can be pre-generated before the voice wakes up.
  • S202 Receive a voice signal input by a user.
  • the user said a word to the terminal.
  • Some initialization process can be performed before receiving the voice signal. For example, you can set wake-up words, generate search spaces, and initialize the audio processing module.
  • S203 Perform audio processing on the voice signal.
  • the audio processing in this embodiment may specifically include: noise reduction and voice enhancement processing.
  • noise reduction can be divided into noise reduction for low frequency noise and noise reduction for non-low frequency noise.
  • noises such as air conditioners and vehicle engines are low-frequency noise, and high-pass filtering techniques can be used to eliminate low-frequency noise.
  • Noise Suppression (NS) technology can be used to eliminate non-low frequency noise.
  • AGC automatic gain control
  • the voice signal to be processed can be obtained by VAD.
  • S205 Decode the speech signal to be processed according to the search space to obtain a speech recognition result.
  • the acoustic features can be extracted from the speech signal, and then the acoustic features are searched in the search space to obtain the optimal path as the speech recognition result.
  • the search algorithm may be a viterbi search algorithm.
  • the search for the path may be directly ended, thereby narrowing the search range, improving search efficiency, and reducing power consumption.
  • the acoustic model is a Hidden Markov Model (HMM). If the difference between the scores of the acoustic models of the adjacent states obtained when searching on one path is greater than the preset value, You can determine this path as an abnormal path.
  • HMM Hidden Markov Model
  • the VAD can be reset immediately after detecting the wake-up word, and the wake-up word detection process is restarted, so as to avoid hitting only one in a VAD.
  • the phenomenon of awakening words occurs.
  • the main function of the resource release is to release the memory occupied by the initialized loaded resources, complete the reset operation of the wakeup module, and clear the history cache data.
  • the false wake-up rate can be reduced.
  • directly determining that the wake-up is not awakened and ending the speech decoding can reduce the power consumption.
  • Anti-noise capability can be improved by performing audio processing on the speech signal.
  • the scale of the inverse model can be reduced, so that it can be applied locally to the terminal to solve the problem of requiring full network connection.
  • the wake-up sensitivity can be improved by weighting the path of the wake-up word. Power consumption can be reduced by directly ending the search for the abnormal path at the time of decoding. By including the path of the inverse model and the wake-up words in the search space, the wakeup can be successfully completed in the sentence of the user's mixed wake-up words, and the wake-up precision is improved.
  • FIG. 4 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present application.
  • the apparatus 40 includes an acquisition module 41, a decoding module 42, a determination module 43, and a first processing module 44.
  • An obtaining module 41 configured to acquire a voice signal to be processed
  • the decoding module 42 is configured to decode the voice signal according to a pre-generated search space, Obtaining a speech recognition result, wherein the search space includes a path of the inverse model, the inverse model includes a first inverse model, and the first inverse model is trained according to a segmentation result of the wake word;
  • the determining module 43 is configured to determine, when the preset number of words of the voice recognition result is obtained, whether the preset number of words includes at least part of the wake words;
  • the first processing module 44 is configured to directly determine not to wake up if not included, and end decoding of the voice signal.
  • the obtaining module 41 includes:
  • a receiving submodule 411 configured to receive a voice signal input by a user
  • An audio processing sub-module 412 configured to perform audio processing on the voice signal
  • the endpoint detection sub-module 413 is configured to perform VAD on the audio signal after the audio processing to obtain a voice signal to be processed.
  • the audio processing sub-module 412 is specifically configured to:
  • the speech signal is AGC to enhance the strength of the speech signal.
  • the apparatus 40 further includes:
  • the second processing module 45 is configured to: if the preset number of words includes at least part of the wake words, continue to perform speech decoding to obtain an entire speech recognition result corresponding to the speech signal; if the entire speech recognition The wakeup word is included in the result, and the wakeup operation is performed.
  • the inverse model further includes: a second inverse model, the second inverse model being trained to generate based on the clustering result of the corpus.
  • the search space further includes: a path where the wake-up word is located, and a weight of the path where the wake-up word is located is weighted.
  • the search space further includes: a path in which the inverse model and the wake-up word are concatenated.
  • the decoding module 42 is specifically configured to: when decoding, if an abnormal path is found, directly end the search for the path.
  • the wake-up words are multiple.
  • the false wake-up rate can be reduced.
  • directly determining that the wake-up is not awakened and ending the speech decoding can reduce the power consumption.
  • Anti-noise capability can be improved by performing audio processing on the speech signal.
  • the scale of the inverse model can be reduced, so that it can be applied locally to the terminal to solve the problem of requiring full network connection.
  • the wake-up sensitivity can be improved by weighting the path of the wake-up word. Power consumption can be reduced by directly ending the search for the abnormal path at the time of decoding. By including the path of the inverse model and the wake-up words in the search space, the wakeup can be successfully completed in the sentence of the user's mixed wake-up words, and the wake-up can be improved. Precision.
  • the embodiment of the present application further provides a terminal, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: acquire a voice signal to be processed; according to a pre-generated search space Decoding the speech signal to obtain a speech recognition result, wherein the search space includes a path of an inverse model, the inverse model includes a first inverse model, and the first inverse model is trained according to a word segmentation result of the wake word Generating; when acquiring a preset number of words in front of the speech recognition result, determining whether the preset number of words includes at least part of the words in the wake-up word; if not, directly determining not to wake up Ending the decoding of the speech signal.
  • the embodiment of the present application further provides a non-transitory computer readable storage medium, when the instructions in the storage medium are executed by a processor of the terminal, enabling the terminal to perform a method, the method comprising: acquiring a to-be-processed a voice signal; decoding the voice signal according to a pre-generated search space to obtain a voice recognition result, wherein the search space includes a path where the inverse model is located, and the inverse model includes a first inverse model, the first The inverse model is generated according to the segmentation result of the wake-up word; when the preset number of words in front of the voice recognition result is obtained, determining whether the preset number of words includes at least part of the wake-up words If not included, it is directly determined not to wake up, and the decoding of the speech signal is ended.
  • the embodiment of the present application further provides a computer program product, in the computer program product
  • performing a method comprising: acquiring a speech signal to be processed; decoding the speech signal according to a pre-generated search space to obtain a speech recognition result, wherein the search space Included in the path of the inverse model, the inverse model includes a first inverse model, the first inverse model is trained to generate according to the word segmentation result of the wake-up word; when the preset number of words in front of the voice recognition result is acquired And determining whether the preset number of words includes at least part of the words in the wake-up word; if not, directly determining not to wake up, ending decoding of the voice signal.
  • portions of the application can be implemented in hardware, software, firmware, or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it can be implemented by any one of the following techniques known in the art or a combination thereof: Discrete logic for logic gates that implement logic functions on data signals, application specific integrated circuits with suitable combination gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), and the like.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically separately, or two or more units may be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules.
  • the integrated modules, if implemented in the form of software functional modules and sold or used as stand-alone products, may also be stored in a computer readable storage medium.
  • the above mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

一种语音唤醒方法和装置,该语音唤醒方法包括:获取待处理的语音信号(S11);根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成(S12);当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字(S13);如果不包含,则直接确定不唤醒,结束对所述语音信号的解码(S14)。该方法能够降低误唤醒率和降低功耗。

Description

语音唤醒方法和装置
相关申请的交叉引用
本申请要求百度在线网络技术(北京)有限公司于2016年5月26日提交的、发明名称为“语音唤醒方法和装置”的、中国专利申请号“201610357702.4”的优先权。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种语音唤醒方法和装置。
背景技术
语音唤醒技术是一种具有开关入口属性的功能。用户通过语音唤醒,可以发起人机交互的操作,即机器只有被用户所说的唤醒词唤醒,才会对用户接下来的语音指令进行识别。
相关技术中存在一些语音唤醒技术,但都存在一定的问题,比如误唤醒率高、抗噪能力差、需要全程联网、功耗高、唤醒词数量仅为一个、唤醒灵敏度低等。
发明内容
本申请旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本申请的一个目的在于提出一种语音唤醒方法。
本申请的另一个目的在于提出一种语音唤醒装置。
为达到上述目的,本申请第一方面实施例提出的语音唤醒方法,包括:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
为达到上述目的,本申请第二方面实施例提出的语音唤醒装置,包括:获取模块,用于获取待处理的语音信号;解码模块,用于根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;判断模块,用于当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;第一处理模块,用于如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本申请实施例至少在一定程度上具有如下技术效果之一:
通过采用根据唤醒词的分词结果训练得到的第一反模型,可以降低误唤醒率。
通过在语音识别结果的前面预设个数的字不包含唤醒词的至少部分内容时,直接确定不唤醒,并结束语音解码,可以降低功耗。
通过对语音信号进行音频处理,可以提高抗噪能力。
通过对语料的聚类结果进行训练生成反模型,可以减小该反模型的规模,从而可以应用在终端本地,以解决需要全程联网的问题。
通过设置多个唤醒词,可以实现通过任一个唤醒词的唤醒。
通过对唤醒词所在路径的加权处理,可以提高唤醒灵敏度。
通过在解码时直接结束异常路径的搜索,可以降低功耗。
通过在搜索空间中包括反模型和唤醒词串联的路径,可以在用户夹杂唤醒词的一句话中依然成功完成唤醒,提高唤醒精度。
本申请实施例还提出了一种终端,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本申请实施例还提出了一种非临时性计算机可读存储介质,当所述存储介质中的指令由终端的处理器执行时,使得终端能够执行一种方法,所述方法包括:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本申请实施例还提出了一种计算机程序产品,当所述计算机程序产品中的指令被处理器执行时,执行一种方法,所述方法包括:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本发明附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中 将变得明显和容易理解,其中:
图1是本申请一个实施例提出的语音唤醒方法的流程示意图;
图2是本申请另一个实施例提出的语音唤醒方法的流程示意图;
图3是本申请实施例中搜索空间的示意图;
图4是本申请一个实施例提出的语音唤醒装置的结构示意图;
图5是本申请一个实施例提出的语音唤醒装置的结构示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的模块或具有相同或类似功能的模块。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。相反,本申请的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等同物。
如上所示,相关技术中的语音唤醒技术存在一定的问题。本申请将基于如下思路分别解决上述技术问题。
(1)构建一个新的反模型,该反模型是通过对唤醒词的分词结果进行训练后生成的。可以避免被唤醒词的部分内容唤醒,以解决误唤醒率高的问题。
(2)对用户输入的语音信号进行音频处理。音频处理例如降噪、信号增强等处理,以解决抗噪能力差的问题。
(3)构建一个新的反模型,该反模型是根据对语料的聚类结果进行训练生 成。可以减小该反模型的规模,从而可以应用在终端本地,以解决需要全程联网的问题。
(4)根据语音识别结果的前几个字直接确定不唤醒。不需要等待全部语音解码完成,可以降低功耗。另外,在解码过程中,如果发现异常路径可以直接结束该路径的搜索,也可以降低功耗。
(5)不限制唤醒词数量,可以是多个。
(6)对唤醒词所在路径进行加权处理,可以使得唤醒词更易走到唤醒词所在路径,以提高唤醒灵敏度。
需要说明的是,虽然上述对应每个技术问题对其主要思路进行了说明,但是,为了解决技术问题,具体的技术方案不限于上述的主要思路,还可以与其他特征相互结合,这些不同技术特征之间的结合依然属于本申请的保护范围。
需要说明的是,虽然上述给出了几种要解决的技术问题,但是,本申请并不限于仅能解决上述技术问题,应用本申请给出的技术方案还可以解决的其他技术问题依然属于本申请的保护范围。
需要说明的是,本申请的每个实施例不限于能够全部完美解决所有的技术问题,而在至少在一定程度上解决至少一个技术问题。
需要说明的是,虽然上述给出了本申请的主要思路,以及后续实施例会对一些特别点进行说明。但是,本申请的创新点并不限于上述的主要思路及特别点所涉及的内容,并不排除本申请中一些并未特殊说明的内容依然可能会包含 本申请的创新点。
可以理解的是,虽然上述进行了一些说明,但依然不排除其他可能方案,因此,与后续本申请给出的实施例相同、相似、等同等情况的技术方案依然属于本申请的保护范围。
下面将结合具体实施例对本申请的技术方案进行说明。
本申请的语音唤醒技术可以具体应用于离线场景,即在终端本地应用。当然,可以理解的是,本申请的语音唤醒技术也可以应用在服务端,以实现在线语音唤醒。
本申请中涉及的终端可以是移动终端、车载终端、机载终端、桌面电脑等各种能够应用语音唤醒技术的终端。
图1是本申请一个实施例提出的语音唤醒方法的流程示意图。
本实施例可以至少在一定程度上解决误唤醒率高和功耗高的问题。
如图1所示,本实施例的流程包括:
S11:获取待处理的语音信号。
初始的语音信号是由用户输入的。
本实施例中,为了提高抗噪性,可以对用户输入的语音信号进行音频处理,以得到待处理的语音信号。具体内容可以参见后续描述。
S12:根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模 型,所述第一反模型根据对唤醒词的分词结果训练生成。
其中,当获取到待处理的语音信号后,可以对该语音信号进行特征提取,获取声学特征,之后采用声学特征在搜索空间中进行最优路径搜索,将最优路径对应的文本确定为语音识别结果。
搜索空间中包括多条路径,具体可以包括:唤醒词所在路径、反模型所在路径,其中,反模型是用于在语音解码时将非唤醒词引导到反模型所在路径。
本实施例中,一个反模型称为第一反模型,该第一反模型是根据对唤醒词的分词结果进行训练生成的。例如,唤醒词是“百度一下”(唤醒词可设置),则可以先对该唤醒词进行分词,分词原则可以是分为首字、当大于三个字时再两两分词,如分词结果是“百”、“百度”和“一下”,则可以将“百度”和“一下”作为训练数据参与反模型的训练,以得到第一反模型。
通过第一反模型,则“百+非‘度一下’”,或者,“百度+非‘一下’”将走到反模型所在路径,避免误唤醒。
S13:当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字。
其中,预设个数的字例如为3个,则当获取到语音信号对应的前3个字时,可以判断这前3个字中是否包含唤醒词中的至少部分字。例如,唤醒词是“百度一下”,则判断前3个字中是否是“百度一”,或者,判断前3个字中的后两个字是否是“百度”,或者,判断前3个字中的最后一个字是否是“百”。
S14:如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
例如,如果前3个字不是“百度一”,且,前3个字中的后两个字不是“百度”,且,前3个字中的最后一个字不是“百”,则直接确定不唤醒,不执行唤醒操作。
另外,当根据前3个字确定不唤醒时,还直接结束对语音信号的解码,即由于语音信号是一段信号,除了前3个字还可以有后续的一个或多个字。本实施例中,当识别出前3个字,并且根据前3个字确定不唤醒时,不需要再继续识别后续的字,而直接结束对这段语音信号的识别,从而可以降低功耗。
本实施例中,通过采用根据唤醒词的分词结果训练得到的第一反模型,可以降低误唤醒率。通过在语音识别结果的前面预设个数的字不包含唤醒词的至少部分内容时,直接确定不唤醒,并结束语音解码,可以降低功耗。
图2是本申请另一个实施例提出的语音唤醒方法的流程示意图。
本实施例可以至少在一定程度上解决误唤醒率高、功耗高、全程联网、抗噪能力差、唤醒词数量仅为一个、唤醒灵敏度差等问题。
可以理解的是,本实施例给出了一个相对解决较全面问题的技术方案,但是,本申请不限于本实施例的方案,还可以将解决不同技术问题的技术特征单独组成技术方案,或者,将不同技术特征进行其他方式的任意多个的组合以得到新的技术方案。
如图2所示,本实施例的流程包括:
S201:构图生成搜索空间。
其中,参见图3,搜索空间可以包括多条路径,包括唤醒词所在路径31和反模型所在路径32。
本实施例中,为了解决用户一句话中夹杂唤醒词的问题,搜索空间中还包括唤醒词和反模型串联所在的路径33。
进一步的,该路径33上经过唤醒词后可以直接进入静音(SIL)状态,或者经过反模型再进入SIL状态。
本实施例中,为了提高唤醒灵敏度,还可以对唤醒词所在路径进行加权,即对唤醒词所在路径的权重在原始基础上进行增加,以使唤醒词更容易进入到唤醒词所在路径。
唤醒词可以设置为一个或多个。
上述的反模型可以包括第一反模型和第二反模型,具体可以由第一反模型和第二反模型并联而成,或者,加权并联等。
如上所示,第一反模型是对唤醒词的分词结果进行训练后生成的。
本实施例中,第二反模型不是直接对语料进行训练生成,而是对语料的聚类结果进行训练后生成的,以降低第二反模型的规模,更利于应用于终端本地。
具体的,可以利用一些常用的语音语料,对发音的音节进行聚类,如聚类成26个,则对应26个字,之后可以根据这26个字训练生成第二反模型,该第二反模型是一个精简模型。
通过上述流程可以完成搜索空间的生成。
可以理解的是,搜索空间可以是在语音唤醒之前预先生成的。
当需要进行语音唤醒时,还执行如下步骤:
S202:接收用户输入的语音信号。
例如,用户对终端说了一段话。
可以理解的是,在接收语音信号之前还可以进行一些初始化流程。例如,可以设置唤醒词、生成搜索空间,并对音频处理模块进行初始化等。
S203:对所述语音信号进行音频处理。
本实施例中的音频处理可以具体包括:降噪和语音增强处理。
其中,降噪又可以分为对低频噪声的降噪,以及对非低频噪声的降噪。
具体的,空调、车载发动机等噪声均属于低频噪声,可以采用高通滤波技术消除低频噪声。
背景的音乐或者人声等噪声属于非低频噪声,可以采用噪声抑制(Noise Suppression,NS)技术消除非低频噪声。
受到硬件麦克风不同增益的影响,有些语音信号的音量可能会处于一个较低的水平,因此,可以采用自动增益控制(Automatic Gain Control,AGC)技术进行语音增强,以将音量过低的音频信号的能量增强到可进行识别的水平。
S204:对音频处理后的语音信号进行语音活动检测(voice activity detection,VAD)。
通过VAD可以得到待处理的语音信号。
S205:根据搜索空间,对待处理的语音信号进行解码,得到语音识别结果。
在解码时,可以先对语音信号提取声学特征,再将声学特征在搜索空间中进行搜索,得到最优路径作为语音识别结果。具体的,搜索算法可以是viterbi搜索算法。
本实施例中,在搜索过程中如果发现某条路径是异常路径,则可以直接结束对该路径的搜索,这样可以缩小搜索范围、提高搜索效率,降低功耗。在异常路径的判断时,以声学模型是隐马尔可夫模型(Hidden Markov Model,HMM)为例,如果在一条路径上搜索时得到相邻状态的声学模型的得分的差值大于预设值,则可以将这条路径确定为异常路径。
进一步的,如果在一段VAD检测的语音当中包含一个以上的唤醒词,可以会在检测唤醒词之后,立刻对VAD进行重置,重新开始进行唤醒词检测过程,以免在一段VAD当中只能命中一个唤醒词的现象发生。
S206:当获取到语音识别结果的前3个字时,判断前3个字中是否包含唤醒词中的至少部分字,若是,执行S207,否则执行S209。
S207:继续语音识别,判断整个语音识别结果中是否包含唤醒词,若是,执行S208,否则执行S209。
S208:执行唤醒操作。
S209:不唤醒。
其中,当连接S207时,在确定不唤醒时,还直接结束对语音信号的解码。
S210:资源释放。
当不唤醒或唤醒后,可以进行资源释放。
资源释放主要功能是对初始化加载的各类资源所占用的内存进行释放,并完成唤醒模块的重置工作,清除历史缓存数据等内容。
本实施例中,通过采用根据唤醒词的分词结果训练得到的第一反模型,可以降低误唤醒率。通过在语音识别结果的前面预设个数的字不包含唤醒词的至少部分内容时,直接确定不唤醒,并结束语音解码,可以降低功耗。通过对语音信号进行音频处理,可以提高抗噪能力。通过对语料的聚类结果进行训练生成反模型,可以减小该反模型的规模,从而可以应用在终端本地,以解决需要全程联网的问题。通过设置多个唤醒词,可以实现通过任一个唤醒词的唤醒。通过对唤醒词所在路径的加权处理,可以提高唤醒灵敏度。通过在解码时直接结束异常路径的搜索,可以降低功耗。通过在搜索空间中包括反模型和唤醒词串联的路径,可以在用户夹杂唤醒词的一句话中依然成功完成唤醒,提高唤醒精度。
图4是本申请一个实施例提出的语音唤醒装置的结构示意图。
如图4所示,该装置40包括:获取模块41、解码模块42、判断模块43和第一处理模块44。
获取模块41,用于获取待处理的语音信号;
解码模块42,用于根据预先生成的搜索空间,对所述语音信号进行解码, 得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;
判断模块43,用于当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
第一处理模块44,用于如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
一些实施例中,参见图5,所述获取模块41包括:
接收子模块411,用于接收用户输入的语音信号;
音频处理子模块412,用于对所述语音信号进行音频处理;
端点检测子模块413,用于对音频处理后的语音信号进行VAD,得到待处理的语音信号。
一些实施例中,音频处理子模块412具体用于:
对所述语音信号进行高通滤波,以去除低频噪声;
对所述语音信号进行噪声抑制,以去除非低频噪声;
对所述语音信号进行AGC,以增强语音信号的强度。
一些实施例中,参见图5,该装置40还包括:
第二处理模块45,用于如果所述预设个数的字中包含唤醒词中的至少部分字,则继续进行语音解码得到所述语音信号对应的整个语音识别结果;如果所述整个语音识别结果中包含唤醒词,则执行唤醒操作。
一些实施例中,所述反模型还包括:第二反模型,所述第二反模型根据语料的聚类结果训练生成。
一些实施例中,所述搜索空间还包括:唤醒词所在路径,所述唤醒词所在路径的权重进行了加权处理。
一些实施例中,所述搜索空间还包括:反模型和唤醒词串联所在路径。
一些实施例中,所述解码模块42具体用于:在解码时,如果发现异常路径,则直接结束对所述路径的搜索。
一些实施例中,所述唤醒词为多个。
可以理解的是,本实施例的装置与上述方法实施例对应,具体内容可以参见方法实施例的相关描述,在此不再详细说明。
本实施例中,通过采用根据唤醒词的分词结果训练得到的第一反模型,可以降低误唤醒率。通过在语音识别结果的前面预设个数的字不包含唤醒词的至少部分内容时,直接确定不唤醒,并结束语音解码,可以降低功耗。通过对语音信号进行音频处理,可以提高抗噪能力。通过对语料的聚类结果进行训练生成反模型,可以减小该反模型的规模,从而可以应用在终端本地,以解决需要全程联网的问题。通过设置多个唤醒词,可以实现通过任一个唤醒词的唤醒。通过对唤醒词所在路径的加权处理,可以提高唤醒灵敏度。通过在解码时直接结束异常路径的搜索,可以降低功耗。通过在搜索空间中包括反模型和唤醒词串联的路径,可以在用户夹杂唤醒词的一句话中依然成功完成唤醒,提高唤醒 精度。
可以理解的是,上述各实施例中相同或相似部分可以相互参考,在一些实施例中未详细说明的内容可以参见其他实施例中相同或相似的内容。
本申请实施例还提出了一种终端,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本申请实施例还提出了一种非临时性计算机可读存储介质,当所述存储介质中的指令由终端的处理器执行时,使得终端能够执行一种方法,所述方法包括:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
本申请实施例还提出了一种计算机程序产品,当所述计算机程序产品中的 指令被处理器执行时,执行一种方法,所述方法包括:获取待处理的语音信号;根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
需要说明的是,在本申请的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本申请的描述中,除非另有说明,“多个”的含义是指至少两个。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有 用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例 是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (16)

  1. 一种语音唤醒方法,其特征在于,包括:
    获取待处理的语音信号;
    根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;
    当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
    如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
  2. 根据权利要求1所述的方法,其特征在于,所述获取待处理的语音信号,包括:
    接收用户输入的语音信号;
    对所述语音信号进行音频处理;
    对音频处理后的语音信号进行VAD,得到待处理的语音信号。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述语音信号进行音频处理,包括:
    对所述语音信号进行高通滤波,以去除低频噪声;
    对所述语音信号进行噪声抑制,以去除非低频噪声;
    对所述语音信号进行AGC,以增强语音信号的强度。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,还包括:
    如果所述预设个数的字中包含唤醒词中的至少部分字,则继续进行语音解码得到所述语音信号对应的整个语音识别结果;
    如果所述整个语音识别结果中包含唤醒词,则执行唤醒操作。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述反模型还包括:第二反模型,所述第二反模型根据语料的聚类结果训练生成。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述搜索空间还包括:唤醒词所在路径,所述唤醒词所在路径的权重进行了加权处理。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述搜索空间还包括:反模型和唤醒词串联所在路径。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,在解码时,如果发现异常路径,则直接结束对所述路径的搜索。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述唤醒词为多个。
  10. 一种语音唤醒装置,其特征在于,包括:
    获取模块,用于获取待处理的语音信号;
    解码模块,用于根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;
    判断模块,用于当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
    第一处理模块,用于如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
  11. 根据权利要求10所述的装置,其特征在于,所述获取模块包括:
    接收子模块,用于接收用户输入的语音信号;
    音频处理子模块,用于对所述语音信号进行音频处理;
    端点检测子模块,用于对音频处理后的语音信号进行VAD,得到待处理的语音信号。
  12. 根据权利要求10-11任一项所述的装置,其特征在于,还包括:
    第二处理模块,用于如果所述预设个数的字中包含唤醒词中的至少部分字,则继续进行语音解码得到所述语音信号对应的整个语音识别结果;如果所述整个语音识别结果中包含唤醒词,则执行唤醒操作。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,所述解码模块具体用于:在解码时,如果发现异常路径,则直接结束对所述路径的搜索。
  14. 一种终端,其特征在于,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为:
    获取待处理的语音信号;
    根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果, 其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;
    当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
    如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
  15. 一种非临时性计算机可读存储介质,其特征在于,当所述存储介质中的指令由终端的处理器执行时,使得终端能够执行一种方法,所述方法包括:
    获取待处理的语音信号;
    根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述第一反模型根据对唤醒词的分词结果训练生成;
    当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
    如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
  16. 一种计算机程序产品,其特征在于,当所述计算机程序产品中的指令被处理器执行时,执行一种方法,所述方法包括:
    获取待处理的语音信号;
    根据预先生成的搜索空间,对所述语音信号进行解码,得到语音识别结果,其中,所述搜索空间包括反模型所在路径,所述反模型包括第一反模型,所述 第一反模型根据对唤醒词的分词结果训练生成;
    当获取到所述语音识别结果的前面的预设个数的字时,判断所述预设个数的字中是否包含唤醒词中的至少部分字;
    如果不包含,则直接确定不唤醒,结束对所述语音信号的解码。
PCT/CN2016/111367 2016-05-26 2016-12-21 语音唤醒方法和装置 WO2017202016A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/099,943 US10867602B2 (en) 2016-05-26 2016-12-21 Method and apparatus for waking up via speech

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610357702.4 2016-05-26
CN201610357702.4A CN105869637B (zh) 2016-05-26 2016-05-26 语音唤醒方法和装置

Publications (1)

Publication Number Publication Date
WO2017202016A1 true WO2017202016A1 (zh) 2017-11-30

Family

ID=56641927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/111367 WO2017202016A1 (zh) 2016-05-26 2016-12-21 语音唤醒方法和装置

Country Status (3)

Country Link
US (1) US10867602B2 (zh)
CN (1) CN105869637B (zh)
WO (1) WO2017202016A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081241A (zh) * 2019-11-20 2020-04-28 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869637B (zh) 2016-05-26 2019-10-15 百度在线网络技术(北京)有限公司 语音唤醒方法和装置
CN106328137A (zh) * 2016-08-19 2017-01-11 镇江惠通电子有限公司 语音控制方法、装置及系统
CN107767863B (zh) * 2016-08-22 2021-05-04 科大讯飞股份有限公司 语音唤醒方法、系统及智能终端
CN106653022B (zh) * 2016-12-29 2020-06-23 百度在线网络技术(北京)有限公司 基于人工智能的语音唤醒方法和装置
CN107223280B (zh) * 2017-03-03 2021-01-08 深圳前海达闼云端智能科技有限公司 机器人唤醒方法、装置和机器人
CN106971719A (zh) * 2017-05-16 2017-07-21 上海智觅智能科技有限公司 一种离线可切换唤醒词的非特定音语音识别唤醒方法
KR102371313B1 (ko) * 2017-05-29 2022-03-08 삼성전자주식회사 사용자 발화를 처리하는 전자 장치 및 그 전자 장치의 제어 방법
CN109147776A (zh) * 2017-06-19 2019-01-04 丽宝大数据股份有限公司 具有声控功能的显示装置及声控时机指示方法
CN107564517A (zh) * 2017-07-05 2018-01-09 百度在线网络技术(北京)有限公司 语音唤醒方法、设备及系统、云端服务器与可读介质
CN107895573B (zh) * 2017-11-15 2021-08-24 百度在线网络技术(北京)有限公司 用于识别信息的方法及装置
CN107919124B (zh) * 2017-12-22 2021-07-13 北京小米移动软件有限公司 设备唤醒方法及装置
US10332543B1 (en) * 2018-03-12 2019-06-25 Cypress Semiconductor Corporation Systems and methods for capturing noise for pattern recognition processing
CN108665900B (zh) 2018-04-23 2020-03-03 百度在线网络技术(北京)有限公司 云端唤醒方法及系统、终端以及计算机可读存储介质
CN108899014B (zh) * 2018-05-31 2021-06-08 中国联合网络通信集团有限公司 语音交互设备唤醒词生成方法及装置
CN108899028A (zh) * 2018-06-08 2018-11-27 广州视源电子科技股份有限公司 语音唤醒方法、搜索方法、装置和终端
CN109032554B (zh) * 2018-06-29 2021-11-16 联想(北京)有限公司 一种音频处理方法和电子设备
CN111063356B (zh) * 2018-10-17 2023-05-09 北京京东尚科信息技术有限公司 电子设备响应方法及系统、音箱和计算机可读存储介质
CN111475206B (zh) * 2019-01-04 2023-04-11 优奈柯恩(北京)科技有限公司 用于唤醒可穿戴设备的方法及装置
CN111554271B (zh) * 2019-01-24 2024-12-06 北京搜狗科技发展有限公司 端到端唤醒词检测方法及装置
CN111862943B (zh) * 2019-04-30 2023-07-25 北京地平线机器人技术研发有限公司 语音识别方法和装置、电子设备和存储介质
KR102246936B1 (ko) * 2019-06-20 2021-04-29 엘지전자 주식회사 음성 인식 방법 및 음성 인식 장치
CN110310628B (zh) * 2019-06-27 2022-05-20 百度在线网络技术(北京)有限公司 唤醒模型的优化方法、装置、设备及存储介质
US20220319511A1 (en) * 2019-07-22 2022-10-06 Lg Electronics Inc. Display device and operation method for same
CN110634483B (zh) * 2019-09-03 2021-06-18 北京达佳互联信息技术有限公司 人机交互方法、装置、电子设备及存储介质
US11488581B1 (en) * 2019-12-06 2022-11-01 Amazon Technologies, Inc. System and method of providing recovery for automatic speech recognition errors for named entities
CN112466304B (zh) * 2020-12-03 2023-09-08 北京百度网讯科技有限公司 离线语音交互方法、装置、系统、设备和存储介质
CN113516977B (zh) * 2021-03-15 2024-08-02 每刻深思智能科技(北京)有限责任公司 关键词识别方法及系统
CN114360522B (zh) * 2022-03-09 2022-08-02 深圳市友杰智新科技有限公司 语音唤醒模型的训练方法、语音误唤醒的检测方法及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103021409A (zh) * 2012-11-13 2013-04-03 安徽科大讯飞信息科技股份有限公司 一种语音启动拍照系统
CN104464723A (zh) * 2014-12-16 2015-03-25 科大讯飞股份有限公司 一种语音交互方法及系统
CN105210146A (zh) * 2013-05-07 2015-12-30 高通股份有限公司 用于控制语音激活的方法和设备
CN105489222A (zh) * 2015-12-11 2016-04-13 百度在线网络技术(北京)有限公司 语音识别方法和装置
WO2016064556A1 (en) * 2014-10-22 2016-04-28 Qualcomm Incorporated Sound sample verification for generating sound detection model
CN105869637A (zh) * 2016-05-26 2016-08-17 百度在线网络技术(北京)有限公司 语音唤醒方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956673A (en) * 1995-01-25 1999-09-21 Weaver, Jr.; Lindsay A. Detection and bypass of tandem vocoding using detection codes
CN101452701B (zh) * 2007-12-05 2011-09-07 株式会社东芝 基于反模型的置信度估计方法及装置
JP4808764B2 (ja) * 2008-12-15 2011-11-02 インターナショナル・ビジネス・マシーンズ・コーポレーション 音声認識システムおよび方法
US9275637B1 (en) * 2012-11-06 2016-03-01 Amazon Technologies, Inc. Wake word evaluation
US9361885B2 (en) * 2013-03-12 2016-06-07 Nuance Communications, Inc. Methods and apparatus for detecting a voice command
US10121471B2 (en) * 2015-06-29 2018-11-06 Amazon Technologies, Inc. Language model speech endpointing
US9996316B2 (en) * 2015-09-28 2018-06-12 Amazon Technologies, Inc. Mediation of wakeword response for multiple devices

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103021409A (zh) * 2012-11-13 2013-04-03 安徽科大讯飞信息科技股份有限公司 一种语音启动拍照系统
CN105210146A (zh) * 2013-05-07 2015-12-30 高通股份有限公司 用于控制语音激活的方法和设备
WO2016064556A1 (en) * 2014-10-22 2016-04-28 Qualcomm Incorporated Sound sample verification for generating sound detection model
CN104464723A (zh) * 2014-12-16 2015-03-25 科大讯飞股份有限公司 一种语音交互方法及系统
CN105489222A (zh) * 2015-12-11 2016-04-13 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105869637A (zh) * 2016-05-26 2016-08-17 百度在线网络技术(北京)有限公司 语音唤醒方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081241A (zh) * 2019-11-20 2020-04-28 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质
CN111081241B (zh) * 2019-11-20 2023-04-07 Oppo广东移动通信有限公司 设备误唤醒的数据检测方法、装置、移动终端和存储介质

Also Published As

Publication number Publication date
CN105869637B (zh) 2019-10-15
CN105869637A (zh) 2016-08-17
US10867602B2 (en) 2020-12-15
US20190139545A1 (en) 2019-05-09

Similar Documents

Publication Publication Date Title
WO2017202016A1 (zh) 语音唤醒方法和装置
CN108010515B (zh) 一种语音端点检测和唤醒方法及装置
US10719115B2 (en) Isolated word training and detection using generated phoneme concatenation models of audio inputs
US9437186B1 (en) Enhanced endpoint detection for speech recognition
US20180293974A1 (en) Spoken language understanding based on buffered keyword spotting and speech recognition
US10685647B2 (en) Speech recognition method and device
US11688392B2 (en) Freeze words
CN110060693A (zh) 模型训练方法、装置、电子设备及存储介质
JP7604656B2 (ja) 検出のシーケンスに基づいたホットフレーズトリガ
CN109955270B (zh) 语音选项选择系统与方法以及使用其的智能机器人
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
JP2020109475A (ja) 音声対話方法、装置、設備、及び記憶媒体
JP2024538771A (ja) デジタル信号プロセッサベースの継続的な会話
CN114299927A (zh) 唤醒词识别方法、装置、电子设备及存储介质
KR20150105847A (ko) 음성구간 검출 방법 및 장치
CN111063356B (zh) 电子设备响应方法及系统、音箱和计算机可读存储介质
US12080276B2 (en) Adapting automated speech recognition parameters based on hotword properties
CN112786047B (zh) 一种语音处理方法、装置、设备、存储介质及智能音箱
JP2024529888A (ja) 程度によるホットワード検出
WO2019242312A1 (zh) 家电设备的唤醒词训练方法、装置及家电设备
US12165641B2 (en) History-based ASR mistake corrections
CN119522453A (zh) 基于历史的asr错误修正
KR20250025701A (ko) 이력 기반 asr 실수 정정
CN113658593A (zh) 基于语音识别的唤醒实现方法及装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16903000

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16903000

Country of ref document: EP

Kind code of ref document: A1