WO2020125038A1 - 语音控制方法及装置 - Google Patents
语音控制方法及装置 Download PDFInfo
- Publication number
- WO2020125038A1 WO2020125038A1 PCT/CN2019/100982 CN2019100982W WO2020125038A1 WO 2020125038 A1 WO2020125038 A1 WO 2020125038A1 CN 2019100982 W CN2019100982 W CN 2019100982W WO 2020125038 A1 WO2020125038 A1 WO 2020125038A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- threshold
- lip
- feature vectors
- user
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 239000013598 vector Substances 0.000 claims description 99
- 238000004590 computer program Methods 0.000 claims description 10
- 239000000284 extract Substances 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims 1
- 230000003993 interaction Effects 0.000 abstract description 11
- 230000002708 enhancing effect Effects 0.000 abstract description 6
- 230000008569 process Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000001815 facial effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001953 sensory effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011038 discontinuous diafiltration by volume reduction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the invention relates to the field of voice recognition, and in particular to a voice control method and device.
- the embodiments of the present application provide a voice control method and device.
- a voice control method which includes: acquiring voice feature data of a user, and acquiring lip feature data corresponding to the voice feature data; determining the control terminal based on the voice feature data and the lip feature data Control word; the control terminal performs the operation corresponding to the control word.
- a voice control device including: an acquisition module for acquiring user's voice feature data and lip feature data corresponding to the voice feature data; a determination module for voice-based The feature data and lip feature data determine the control word of the control terminal; the control module is used to control the terminal to perform operations corresponding to the control word.
- a computer-readable storage medium stores a computer program, and the computer program is used to execute the foregoing voice control method.
- an electronic device including: a processor; a memory for storing processor-executable instructions, wherein the processor is used to execute the foregoing voice control method.
- Embodiments of the present application provide a voice control method and device, which recognizes control words in a user's voice by fusing voice feature data and lip feature data. It can be used in many situations such as high noise, low light, and low sound energy. Improve the accuracy of the user's voice when it is collected, thereby improving the accuracy of voice recognition, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
- FIG. 1A is a schematic diagram of a system architecture of a voice control system provided by an exemplary embodiment of the present application.
- FIG. 1B is a schematic flowchart of a voice control method provided by an exemplary embodiment of the present application.
- FIG. 2 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 3 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 4 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 5 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 6 is a schematic structural diagram of a voice control device provided by an exemplary embodiment of the present application.
- FIG. 7 is a schematic structural diagram of a voice control device provided by another exemplary embodiment of the present application.
- FIG. 8 is a block diagram of an electronic device provided by an exemplary embodiment of the present application.
- the voice interaction process includes a voice wake-up process and a voice command recognition process.
- wind noise, whistle, road noise and other large noises are easily mixed, thereby interfering with the voice interaction process.
- problems such as false wake-up, unawakening, and loud roaring are prone to occur, resulting in poor user experience.
- FIG. 1A is a schematic diagram of a system architecture of a voice control system 1 according to an exemplary embodiment of the present application, which shows an application scenario of waking up a terminal (for example, a vehicle-mounted device).
- the voice control system 1 includes an electronic device 10, a voice collection device 20 (for example, a microphone array), an image collection device 30 (for example, a camera), and a terminal 40.
- the voice collecting device 20 is used to collect user's voice.
- the image acquisition device 30 is used to acquire a video image containing the user's lip image.
- the electronic device 10 is used to receive voice signals and video image signals from the voice acquisition device 20 and the image acquisition device 30 respectively, perform control word (eg, wake-up word) recognition on the voice signal and video image signal, and control the terminal according to the control word recognition result 40 Perform the corresponding operation.
- control word eg, wake-up word
- voice collection device 20 and the image collection device 30 in this application may also be integrated on the electronic device 10.
- FIG. 1B is a schematic flowchart of a voice control method provided by an exemplary embodiment of the present application.
- the execution subject of this embodiment may be, for example, the electronic device in FIG. 1A, as shown in FIG. 1B, the method includes the following steps:
- Step 110 Obtain the voice characteristic data of the user.
- Step 120 Acquire lip feature data corresponding to the voice feature data.
- step 120 may be executed before step 110, or may be executed simultaneously with step 110.
- the technical solution of the present application is described in detail by taking the terminal as an in-vehicle device as an example, where the in-vehicle device may be a speaker, a display device, etc. in the in-vehicle system, and the execution subject (electronic device) of the method may It is the controller of the vehicle-mounted equipment, or the controller of the vehicle-mounted system.
- the vehicle-mounted system may further include a camera and a microphone, and the camera and the microphone may be installed at positions corresponding to the main driver in the car.
- the controller can collect the lip feature data of the user by controlling the camera, and at the same time collect the voice feature data of the user by controlling the microphone.
- the lip feature data and the voice feature data correspond. For example, if the user says "start”, then the lip feature data and the voice feature data both correspond to "start".
- the lip feature data may be an image representing changes in lip motion, or may be matrix or vector data extracted from the image to characterize the content of the lip language.
- the voice feature data may be a voice segment, or may be matrix or vector data extracted from the voice segment to characterize the voice content.
- Step 130 Determine the control word of the control terminal based on the voice feature data and the lip feature data.
- the control word can be a wake-up word or a voice command.
- the control can be determined by comparing the voice feature data with the preset voice feature data, and simultaneously comparing the lip feature data with the preset lip feature data, and combining the comparison results of the two data word.
- the control word may be determined by simultaneously inputting the voice feature data and the lip feature data into the control word recognition model.
- the control word recognition model may be obtained by training multiple samples, where each sample contains sample speech feature data and sample lip feature data. During training, multiple samples in different light environments or different noise environments can be trained and learned to obtain a control word recognition model.
- Step 140 The control terminal performs an operation corresponding to the control word.
- the controller controls the terminal to perform the corresponding operation.
- the control word is a wake-up word
- the terminal is woken up;
- the control word is a voice command
- the terminal performs the operation corresponding to the command;
- the control word includes both the wake-up word and the voice command, the terminal is woken up and then executes The operation corresponding to the instruction.
- the control word is the wake-up word "start”
- the speaker is awakened, that is, it is in a working state and starts playing music
- the control word is a word such as the voice command "decrease volume” or “decrease volume”
- the volume of the speaker will become smaller
- the control word is "start, reduce the volume” Or “start, reduce volume”
- the speaker is in working state and starts playing music at a lower volume setting than before starting.
- the degree of volume reduction can be set in advance according to usage.
- An embodiment of the present application provides a voice control method, which recognizes control words in a user's voice by fusing voice feature data and lip feature data, and can combine two auditory and visual sensory channels to obtain multimodal information, thereby enhancing control Word recognition makes up for the shortcomings of identifying control words through separate voice or lip data, which greatly improves the accuracy of the user's voice when it is collected under many conditions such as high noise, low light, and low sound energy. In turn, the accuracy of voice recognition is improved, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
- the voice feature data may be a voice feature vector
- the voice feature vector may include FBank voice feature parameters or Mel cepstrum coefficients MFCC voice feature parameters.
- step 120 includes: collecting continuous multi-frame images containing changes in the user's lip motion, and extracting lip feature data based on the continuous multi-frame images.
- a video of a user's face changes within a certain period of time may be recorded by a camera, continuous multi-frame images corresponding to the video within the period of time may be collected, and lip feature data may be further extracted based on the continuous multi-frame images.
- the continuous multi-frame images may be images containing the user's complete facial features, or images containing part of the facial features (lips).
- collecting continuous multi-frame images may be a continuous multi-frame face image of a user identified from multiple continuous video images using a machine vision recognition model; continuous multi-frame images (including part of the face) are collected from continuous multi-frame face images Feature image).
- step 120 may include: collecting continuous multi-frame images including changes in the user's lip motion; for each frame image in the continuous multi-frame images, extracting a plurality of feature points describing the shape of the lips ; Normalize the coordinates of multiple feature points of each frame image in consecutive multi-frame images to obtain lip feature data.
- lip feature data obtained through feature points can be improved Recognition accuracy of lip language.
- the convergence of data processing can be accelerated, thereby increasing the speed of lip recognition.
- each frame of image corresponds to a shape of a lip.
- the upper image is the lip image when the user is not speaking
- the lower image is the image when the user is speaking.
- Multiple feature points can be extracted around the inner and outer lip edges in each lip shape, and each feature point can be represented by coordinates, and the multiple feature points can represent the lip shape in the frame image.
- the head When the user is speaking, the head may swing, which may cause the origin of coordinates in each frame of image to be deviated when extracting feature points. Therefore, the feature points corresponding to the multi-frame images can be normalized, so that the coordinate origins of the feature points corresponding to each frame image are consistent, and the accuracy of lip feature data is improved, thereby improving the recognition rate of control words.
- control method further includes: determining whether the illuminance of the environment in which the user is located is greater than the first threshold, and determining the angle of the human face in each of the consecutive multi-frame images (the human face is facing the camera) Whether the angle at time is zero) is less than or equal to the second threshold, wherein, if the illuminance is greater than the first threshold and the angle is less than or equal to the second threshold, extracting lip feature data based on continuous multi-frame images is performed.
- the light emitting device when the illuminance of the environment is lower than or equal to the first threshold, the light emitting device can be controlled to supplement the light, thereby determining whether the angle faced by the human face is smaller than or equal to the second threshold, and if so, collecting the user’s lips Continuous multi-frame images with varying motion.
- the acquisition of lip feature data may be abandoned, and the control word is directly determined based on the voice feature data.
- control method further includes: determining whether the energy of the user's voice is greater than the third threshold, and determining whether the duration of the sound is greater than the fourth threshold, wherein, if the energy of the voice is greater than the third threshold and the sound continues If the time is greater than the fourth threshold, then acquiring the voice characteristic data of the user is performed.
- the energy of the user's voice is too small, effective voice feature data cannot be obtained from the voice, that is, even if the lip feature data is obtained, it is difficult to determine the control word.
- the duration of the user's voice is too short, the user's voice may not contain the exact words at this time, such as "ah", "oh”, etc.
- the voice feature data is obtained based on the sound, it will increase the calculation burden of the controller, and the control word will not be obtained. Therefore, when the energy of the sound is greater than the third threshold and the duration of the sound is greater than the fourth threshold, the voice feature data is acquired based on the sound.
- the third threshold and the fourth threshold can be set according to the control word actually used.
- the acquisition of speech feature data may be abandoned, and the control word is directly determined based on the lip feature data.
- the control word includes a wake word
- the lip feature data includes a plurality of lip feature vectors
- the voice feature data includes a plurality of acoustic feature vectors
- the control word recognition model is a wake word recognition model.
- step 130 includes: using multiple lip feature vectors and multiple acoustic feature vectors to obtain multiple combined feature vectors; using a wake-up word recognition model to perform wake-up word recognition on the multiple combined feature vectors to obtain wake-up words.
- the above speech feature vector may specifically be an acoustic feature vector.
- the acoustic feature vector may include a vector composed of MFCC (Meier Cepstral Coefficients) parameters, differential parameters (characterizing changes in MFCC parameters), and frame energy, where the MFCC parameters are used to characterize static features of speech, and differential parameters It is used to characterize the dynamic features of speech, and frame energy is used to characterize the energy features of speech.
- MFCC Machine Cepstral Coefficients
- differential parameters characterizing changes in MFCC parameters
- frame energy is used to characterize the energy features of speech.
- the acoustic feature vector is a 39-dimensional MFCC vector, including 12-dimensional MFCC parameters, 12-dimensional first-order differential parameters, 12-dimensional second-order differential parameters, and 3-dimensional frame energy.
- Each image in a continuous multi-frame image corresponds to a lip feature vector
- the lip feature vector may be composed of multiple feature points.
- the voice collected by the microphone can also be divided into multiple frames according to time, and each frame of voice corresponds to a sound feature vector.
- Multiple lip feature vectors and multiple voice feature vectors can be in one-to-one correspondence, and multiple combined feature vectors can be obtained by combining the lip feature vector and the voice feature vector.
- FIG. 2 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 2 is an example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
- the control method includes the following.
- Step 205 Use the microphone array to collect the user's voice.
- the microphone array may refer to a microphone containing multiple channels.
- Step 210 Acquire user's voice characteristic data.
- Step 215 Collect continuous multi-frame images containing changes in the user's lips.
- Step 220 If the illuminance is greater than the first threshold and the angle faced by the face is less than or equal to the second threshold, then step 225 is performed; otherwise, step 250 is performed.
- Step 225 Extract lip feature data based on continuous multi-frame images.
- Step 230 Determine the control word of the control terminal based on the voice characteristic data and the lip characteristic data.
- Step 240 The control terminal performs an operation corresponding to the control word.
- Step 250 If the illuminance is greater than the first threshold and the angle the face is facing is greater than the second threshold, then perform steps 255 and 260, otherwise end the entire process.
- Step 255 Determine the sound source location of the sound using the time difference of arrival method (TDOA).
- TDOA time difference of arrival method
- Step 260 Adjust the camera angle of the camera used to shoot consecutive multi-frame images according to the position of the sound source.
- the location of the sound source of the sound can be determined by the TDOA method, and then the camera angle can be adjusted so that the camera can collect the frontal image of the user.
- Step 255 and step 215 can be performed at the same time, that is, the user's facial image is collected while adjusting the camera angle.
- the angle of the face facing the camera may be greater than the second threshold.
- adjusting the camera angle according to the position of the sound source can make the angle of the face facing the camera at the next moment less than the second threshold, which can improve the lip feature data Accuracy.
- FIG. 3 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 3 is an example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
- the control method includes the following.
- Step 305 Use the microphone array to collect the user's voice.
- Step 310 Determine whether the energy of the user's voice is greater than the third threshold, and determine whether the duration of the voice is greater than the fourth threshold.
- step 315 is performed, otherwise step 320 is performed.
- Step 315 Obtain the voice characteristic data of the user.
- Step 320 If the energy of the sound is less than or equal to the third threshold, or the duration of the sound is less than or equal to the fourth threshold, it is further determined whether the similarity between the lip feature data of two adjacent images in consecutive multi-frame images Below the fifth threshold.
- the controller may determine whether the similarity between the lip feature data of two adjacent frames of images is lower than the fifth threshold according to the time sequence. If the similarity between the lip feature data of two adjacent images is lower than the fifth threshold, it means that the user's lips are moving. At this time, the user may once again say a sentence containing a control word, so the default Repeat step 305 within the time.
- Step 325 Acquire continuous multi-frame images containing changes in the user's lips.
- Step 330 Extract lip feature data based on continuous multi-frame images.
- Step 330 may be performed before step 320.
- Step 340 Determine the control word of the control terminal based on the voice characteristic data and the lip characteristic data.
- Step 345 The control terminal performs the operation corresponding to the control word.
- FIG. 4 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 4 is the example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
- the control method includes the following.
- Step 410 Acquire multiple acoustic feature vectors of the user.
- Step 420 Acquire multiple lip feature vectors corresponding to multiple acoustic feature vectors.
- the controller can reduce the weight of the acoustic feature vector and increase the weight of the lip feature vector, that is, by adjusting the weight values of the two vectors, a more accurate combined feature vector can be obtained.
- step 431 is executed.
- the sixth threshold may be greater than or equal to the third threshold in FIG. 3.
- Step 431 Determine a first weight value corresponding to multiple lip feature vectors and a second weight value corresponding to multiple acoustic feature vectors.
- the first weight value is greater than the second weight value.
- step 431 may also be performed, at which time the first weight value may be equal to the second weight value.
- the distribution of weight values may be determined by a weight value distribution model, which may be obtained by performing deep learning on samples of different scenes (sound environment, illuminance, angle of face facing).
- Step 432 Use the first weight value and the second weight value to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
- the accuracy of the combined feature vector can be improved, thereby improving the accuracy of the wake word recognition.
- Step 433 Use a wake-up word recognition model to recognize wake-up words from multiple combined feature vectors to obtain a wake-up word.
- Step 440 Wake up the terminal.
- FIG. 5 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
- FIG. 5 is the example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
- the control method includes the following.
- Step 510 Acquire multiple acoustic feature vectors of the user.
- Step 520 Acquire multiple lip feature vectors corresponding to multiple acoustic feature vectors.
- step 531 is executed.
- Step 531 Determine a third weight value corresponding to multiple lip feature vectors and a fourth weight value corresponding to multiple acoustic feature vectors.
- the third weight value is less than the fourth weight value.
- step 531 may also be executed, and the third weight value may be equal to the fourth weight value.
- Step 532 Use the third weight value and the fourth weight value to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
- the accuracy rate of the combined feature vector can be improved, thereby improving the accuracy rate of the wake word recognition.
- Step 533 Use a wake-up word recognition model to recognize wake-up words from multiple combined feature vectors to obtain a wake-up word.
- Step 540 Wake up the terminal.
- the weight values of the lip feature vector and the speech feature vector may also be redistributed according to the angle of the face in the image.
- FIG. 1 to FIG. 5 can complement each other to obtain a control word recognition process with more complete functions and higher accuracy.
- FIG. 6 is a schematic structural diagram of a voice control device 600 provided by an exemplary embodiment of the present application. As shown in FIG. 6, the device 600 includes an acquisition module 610, a determination module 620, and a control module 630.
- the obtaining module 610 is used to obtain the user's voice feature data and to obtain the lip feature data corresponding to the voice feature data; the determining module 620 is used to determine the control word of the control terminal based on the voice feature data and the lip feature data; the control module 630 uses The control terminal performs the operation corresponding to the control word.
- An embodiment of the present application provides a voice control device, which recognizes control words in a user's voice by fusing voice feature data and lip feature data, and can fuse two auditory and visual sensory channels to obtain multimodal information, thereby enhancing control Word recognition makes up for the shortcomings of identifying control words through separate voice or lip data, which greatly improves the accuracy of the user's voice when it is collected under many conditions such as high noise, low light, and low sound energy. In turn, the accuracy of voice recognition is improved, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
- the obtaining module 610 is used to collect continuous multi-frame images containing changes in the user's lip motion, and extract lip feature data based on the continuous multi-frame images.
- the acquisition module 610 is used to extract a plurality of feature points for describing the shape of the lips for each frame image in the continuous multi-frame image; for each frame image in the continuous multi-frame image The coordinates of each feature point are normalized to obtain lip feature data.
- the determination module 620 is further used to determine whether the illuminance of the environment where the user is located is greater than the first threshold, and determine whether the angle of the face in each frame of the continuous multi-frame images is less than or equal to the second Threshold.
- the acquisition module 610 extracts lip feature data based on consecutive multi-frame images.
- the determination module 620 is further used to determine whether the energy of the user's voice is greater than the third threshold, and determine whether the duration of the voice is greater than the fourth threshold.
- FIG. 7 is a schematic structural diagram of a voice control apparatus 700 provided by another exemplary embodiment of the present application.
- FIG. 7 is the example of FIG. 6, and the same points will not be repeated.
- the apparatus 700 includes an acquisition module 710, a determination module 720, a control module 730, and an acquisition module 740.
- the acquisition module 710, the specific functions of the determination module 720, and the control module 730 can refer to the description in FIG. 6.
- the collection module 740 is used to collect the user's voice using the microphone array.
- the determination module 720 is also used to determine the sound source position of the sound using the time difference of arrival method TDOA.
- the device 700 further includes: an adjustment module 750 for adjusting the camera angle of the camera used to capture consecutive multi-frame images according to the position of the sound source.
- the acquiring module 710 acquires the voice characteristic data of the user.
- the determination module 720 is further configured to determine whether the similarity between the lip feature data of two adjacent images in consecutive multi-frame images is lower than a fifth threshold.
- the collection module 740 if the similarity is lower than the fifth threshold, and the energy of the sound is less than or equal to the third threshold or the duration of the sound is less than or equal to the fourth threshold, the collection module 740 repeatedly collects users within a preset time the sound of.
- the control word includes a wake word
- the lip feature data includes multiple lip feature vectors
- the voice feature data includes multiple acoustic feature vectors
- the determination module 720 is configured to utilize multiple lip feature vectors and Multiple acoustic feature vectors obtain multiple combined feature vectors, and use the wake word recognition model to recognize the wake word from the multiple combined feature vectors to obtain the wake word.
- the determining module 720 is further used to determine whether the energy of the user's voice is less than the sixth threshold, and determine whether the noise level of the voice is greater than the seventh threshold.
- the determination module 720 is used to determine a first weight value corresponding to a plurality of lip feature vectors and a second weight value corresponding to a plurality of acoustic feature vectors, The first weight value is greater than the second weight value, and the first weight value and the second weight value are used to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
- the determination module 720 is further used to determine whether the illuminance of the environment where the user is located is less than the eighth threshold.
- the determination module 720 is used to determine a third weight value corresponding to multiple lip feature vectors and a fourth weight value corresponding to multiple acoustic feature vectors, where the third weight value is less than the fourth weight value , And use the third weight value and the fourth weight value to perform weighted calculation on multiple lip feature vectors and multiple acoustic feature vectors to obtain multiple combined feature vectors.
- the acquisition module 710 is used to identify a user's continuous multi-frame face images from multiple continuous video images using a machine vision recognition model, and collect continuous multi-frame images from the continuous multi-frame face images.
- the electronic device 80 can perform the aforementioned voice control process.
- FIG. 8 illustrates a block diagram of an electronic device according to an embodiment of the present application.
- the electronic device 80 includes one or more processors 81 and memory 82.
- the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
- CPU central processing unit
- the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
- the memory 82 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- the volatile memory may include, for example, random access memory (RAM) and/or cache memory.
- the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, and the like.
- One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 81 may execute the program instructions to implement the voice control method and/or other embodiments of the embodiments of the present application described above Desired function.
- Various contents such as voice signals, video image signals, etc. can also be stored in the computer-readable storage medium.
- the electronic device 80 may further include: an input device 83 and an output device 84, and these components are interconnected by a bus system and/or other forms of connection mechanisms (not shown).
- the input device 83 may be the aforementioned microphone array and camera, which are used to capture input signals of voice and video images, respectively.
- the input device 83 may be a communication network connector for receiving the collected input signals from the microphone array and the camera.
- the input device 83 may also include, for example, a keyboard, a mouse, and the like.
- the output device 84 can output various information to the outside, including the determined control word and the like.
- the output device 84 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output device.
- the electronic device 80 may also include any other suitable components.
- embodiments of the present application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the above-described "exemplary method" of this specification The steps in the voice control method according to various embodiments of the present application described in the section.
- the computer program product may write program codes for performing operations of the embodiments of the present application in any combination of one or more programming languages, and the programming languages include object-oriented programming languages, such as Java, C++, etc. , Also includes conventional procedural programming languages, such as "C" language or similar programming languages.
- the program code may be executed entirely on the user's computing device, partly on the user's device, as an independent software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server To execute.
- an embodiment of the present application may also be a computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor causes the processor to perform the above-mentioned "exemplary method" part of the specification
- the steps in the voice control method according to various embodiments of the present application are described in.
- the computer-readable storage medium may employ any combination of one or more readable media.
- the readable medium may be a readable signal medium or a readable storage medium.
- the readable storage medium may include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any combination of the above, for example. More specific examples of readable storage media (non-exhaustive list) include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
- each component or each step can be decomposed and/or recombined.
- decompositions and/or recombinations shall be regarded as equivalent solutions of this application.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims (13)
- 一种语音控制方法,包括:获取用户的语音特征数据,并获取与所述语音特征数据对应的唇部特征数据;基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词;控制所述终端执行与所述控制词相对应的操作。
- 根据权利要求1所述的方法,其中,所述获取与所述语音特征数据对应的唇部特征数据,包括:采集包含所述用户的唇部动作变化的连续多帧图像,针对所述连续多帧图像中的每一帧图像,提取用于描述唇部形状的多个特征点;对所述连续多帧图像中的每一帧图像的多个特征点的坐标进行归一化处理,得到所述唇部特征数据。
- 根据权利要求2所述的方法,还包括:确定所述用户所处环境的照度是否大于第一阈值,并确定所述连续多帧图像中的每一帧图像中人脸面向的角度是否小于或等于第二阈值,其中,若所述照度大于所述第一阈值且所述角度小于或等于所述第二阈值,则执行所述基于所述连续多帧图像提取所述唇部特征数据。
- 根据权利要求3所述的方法,还包括:利用麦克风阵列采集所述用户的声音;若所述角度大于所述第二阈值,则利用到达时间差方法TDOA确定所述声音的声源位置;根据所述声源位置调整用于拍摄所述连续多帧图像的摄像机的摄像角度。
- 根据权利要求2所述的方法,还包括:确定所述用户的声音的能量是否大于第三阈值,并确定所述声音持续的时间是否大于第四阈值,其中,若所述声音的能量大于所述第三阈值且所述声音持续的时间大于所述第四阈值,则执行所述获取用户的语音特征数据。
- 根据权利要求5所述的方法,还包括:确定所述连续多帧图像中两个相邻图像的唇部特征数据之间的相似度是否低于第五阈值;若所述相似度低于所述第五阈值,并且所述声音的能量小于或等于所述第三阈值或所述声音持续的时间小于或等于所述第四阈值,则在预设的时间内重复采集所述用户的声音。
- 根据权利要求2所述的方法,其中,所述控制词包括唤醒词,所述唇部特征数据包括多个唇部特征向量,所述语音特征数据包括多个声学特征向量,其中,所述基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词,包括:利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量;利用唤醒词识别模型对所述多个组合特征向量进行唤醒词识别,得到所述唤醒词。
- 根据权利要求7所述的方法,还包括:确定所述用户的声音的能量是否小于第六阈值,并确定所述声音的噪音程度是否大于第七阈值,其中,所述利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量,包括:若所述声音的能量小于所述第六阈值或者所述声音的噪音程度大于所述第七阈值,则确定所述多个唇部特征向量对应的第一权重值和所述多个声学特征向量对应的第二权重值,其中,所述第一权重值大于所述第二权重值;分别采用所述第一权重值和所述第二权重值对所述多个唇部特征向量和所述多个声学特征向量进行加权计算,得到所述多个组合特征向量。
- 根据权利要求7所述的方法,还包括:确定所述用户所处环境的照度是否小于第八阈值,其中,所述利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量,包括:若所述照度小于所述第八阈值,则确定所述多个唇部特征向量对应的第三权重值和所述多个声学特征向量对应的第四权重值,其中,所述第三权重值小于所述第四权重值;分别采用所述第三权重值和所述第四权重值对所述多个唇部特征向量和所述多个声学特征向量进行加权计算,得到所述多个组合特征向量。
- 根据权利要求1至9中的任一项所述的方法,其中,所述获取用户的语音特征数据,包括:利用麦克风阵列采集所述用户的声音;从所述声音提取连续的语音特征向量,所述语音特征向量包括FBank语音特征参数或梅尔倒谱系数MFCC语音特征参数。
- 一种语音控制装置,包括:获取模块,用于获取用户的语音特征数据,并获取与所述语音特征数据对应的唇部特征数据;确定模块,用于基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词;控制模块,用于控制所述终端执行与所述控制词相对应的操作。
- 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1至10中任一项所述的语音控制方法。
- 一种电子设备,所述电子设备包括:处理器;用于存储所述处理器可执行指令的存储器,其中,所述处理器用于执行上述权利要求1至10中任一项所述的语音控制方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811543052.8 | 2018-12-17 | ||
CN201811543052.8A CN111326152A (zh) | 2018-12-17 | 2018-12-17 | 语音控制方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020125038A1 true WO2020125038A1 (zh) | 2020-06-25 |
Family
ID=71100644
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/100982 WO2020125038A1 (zh) | 2018-12-17 | 2019-08-16 | 语音控制方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111326152A (zh) |
WO (1) | WO2020125038A1 (zh) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113436618A (zh) * | 2020-08-22 | 2021-09-24 | 彭玲玲 | 一种用于语音指令捕捉的信号精确度调节系统 |
CN113689858B (zh) * | 2021-08-20 | 2024-01-05 | 广东美的厨房电器制造有限公司 | 烹饪设备的控制方法、装置、电子设备及存储介质 |
CN113723528B (zh) * | 2021-09-01 | 2023-12-29 | 斑马网络技术有限公司 | 车载语视融合多模态交互方法及系统、设备、存储介质 |
CN114093354A (zh) * | 2021-10-26 | 2022-02-25 | 惠州市德赛西威智能交通技术研究院有限公司 | 一种提高车载语音助手识别准确率的方法及系统 |
CN117672228B (zh) * | 2023-12-06 | 2024-06-25 | 江苏中科重德智能科技有限公司 | 基于机器学习的智能语音交互误唤醒系统及方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
CN102314595A (zh) * | 2010-06-17 | 2012-01-11 | 微软公司 | 用于改善话音识别的rgb/深度相机 |
US20130054240A1 (en) * | 2011-08-25 | 2013-02-28 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
CN103218924A (zh) * | 2013-03-29 | 2013-07-24 | 上海众实科技发展有限公司 | 一种基于音视频双模态的口语学习监测方法 |
CN105096935A (zh) * | 2014-05-06 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 一种语音输入方法、装置和系统 |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004024863A (ja) * | 1994-05-13 | 2004-01-29 | Matsushita Electric Ind Co Ltd | 口唇認識装置および発生区間認識装置 |
CN102117115B (zh) * | 2009-12-31 | 2016-11-23 | 上海量科电子科技有限公司 | 一种利用唇语进行文字输入选择的系统及实现方法 |
JP5323770B2 (ja) * | 2010-06-30 | 2013-10-23 | 日本放送協会 | ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機 |
CN102004549B (zh) * | 2010-11-22 | 2012-05-09 | 北京理工大学 | 一种适用于中文的自动唇语识别系统 |
CN105389097A (zh) * | 2014-09-03 | 2016-03-09 | 中兴通讯股份有限公司 | 一种人机交互装置及方法 |
CN107799125A (zh) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | 一种语音识别方法、移动终端及计算机可读存储介质 |
CN108346427A (zh) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | 一种语音识别方法、装置、设备及存储介质 |
CN108446675A (zh) * | 2018-04-28 | 2018-08-24 | 北京京东金融科技控股有限公司 | 面部图像识别方法、装置电子设备及计算机可读介质 |
-
2018
- 2018-12-17 CN CN201811543052.8A patent/CN111326152A/zh active Pending
-
2019
- 2019-08-16 WO PCT/CN2019/100982 patent/WO2020125038A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
CN102314595A (zh) * | 2010-06-17 | 2012-01-11 | 微软公司 | 用于改善话音识别的rgb/深度相机 |
US20130054240A1 (en) * | 2011-08-25 | 2013-02-28 | Samsung Electronics Co., Ltd. | Apparatus and method for recognizing voice by using lip image |
CN103218924A (zh) * | 2013-03-29 | 2013-07-24 | 上海众实科技发展有限公司 | 一种基于音视频双模态的口语学习监测方法 |
CN105096935A (zh) * | 2014-05-06 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 一种语音输入方法、装置和系统 |
Also Published As
Publication number | Publication date |
---|---|
CN111326152A (zh) | 2020-06-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11854527B2 (en) | Electronic device and method of controlling speech recognition by electronic device | |
WO2020125038A1 (zh) | 语音控制方法及装置 | |
WO2021093449A1 (zh) | 基于人工智能的唤醒词检测方法、装置、设备及介质 | |
US11854550B2 (en) | Determining input for speech processing engine | |
CN110364143B (zh) | 语音唤醒方法、装置及其智能电子设备 | |
US20150325240A1 (en) | Method and system for speech input | |
US20200335128A1 (en) | Identifying input for speech recognition engine | |
CN107972028B (zh) | 人机交互方法、装置及电子设备 | |
CN110310623A (zh) | 样本生成方法、模型训练方法、装置、介质及电子设备 | |
CN113643693B (zh) | 以声音特征为条件的声学模型 | |
JP2012047924A (ja) | 情報処理装置、および情報処理方法、並びにプログラム | |
CN112017633B (zh) | 语音识别方法、装置、存储介质及电子设备 | |
CN113516990A (zh) | 一种语音增强方法、训练神经网络的方法以及相关设备 | |
CN110706707B (zh) | 用于语音交互的方法、装置、设备和计算机可读存储介质 | |
US11322151B2 (en) | Method, apparatus, and medium for processing speech signal | |
CN113611316A (zh) | 人机交互方法、装置、设备以及存储介质 | |
US10847154B2 (en) | Information processing device, information processing method, and program | |
CN110853669A (zh) | 音频识别方法、装置及设备 | |
WO2020073839A1 (zh) | 语音唤醒方法、装置、系统及电子设备 | |
JP7511374B2 (ja) | 発話区間検知装置、音声認識装置、発話区間検知システム、発話区間検知方法及び発話区間検知プログラム | |
JP7580383B2 (ja) | 発話処理エンジンのための入力の決定 | |
KR102722595B1 (ko) | 전자 장치 및 그 제어 방법 | |
CN117649848A (zh) | 语音信号的处理设备及方法 | |
CN115148188A (zh) | 语种识别方法、装置、电子设备和介质 | |
CN116259309A (zh) | 一种终端设备及自定义唤醒词的检测方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19898488 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19898488 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19898488 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 09.02.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19898488 Country of ref document: EP Kind code of ref document: A1 |