WO2016150001A1 - 语音识别的方法、装置及计算机存储介质 - Google Patents
语音识别的方法、装置及计算机存储介质 Download PDFInfo
- Publication number
- WO2016150001A1 WO2016150001A1 PCT/CN2015/079317 CN2015079317W WO2016150001A1 WO 2016150001 A1 WO2016150001 A1 WO 2016150001A1 CN 2015079317 W CN2015079317 W CN 2015079317W WO 2016150001 A1 WO2016150001 A1 WO 2016150001A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- vocabulary
- voice
- candidate
- recognition
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000033001 locomotion Effects 0.000 claims description 50
- 230000001815 facial effect Effects 0.000 claims description 15
- 238000005516 engineering process Methods 0.000 abstract description 13
- 238000012545 processing Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 10
- 238000003672 processing method Methods 0.000 description 9
- 238000012790 confirmation Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008447 perception Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 210000001097 facial muscle Anatomy 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
Definitions
- the present invention relates to the field of communications, and in particular to a method and apparatus for voice recognition and a computer storage medium.
- the speech recognition technology has been applied more and more in various fields, and its recognition rate is constantly improving. Under the specific conditions of quiet environment and pronunciation standard, the recognition rate of the speech recognition input text system has reached more than 95%.
- the conventional speech recognition technology has been relatively mature. For speech recognition of mobile terminals, since the speech quality is relatively poor compared to the ordinary speech recognition scene, the speech recognition effect is limited.
- the poor voice quality here includes the following reasons, such as the background noise of the client, the voice collection device of the client, the noise of the communication device, the noise and interference of the communication line, the accent with the voice or the use of the dialect, the speaker itself. The words are vague or unclear. All of these factors can cause poor speech recognition.
- the recognition rate is affected by many factors.
- the human language cognition process is a multi-channel perception process.
- the content of others' speech is perceived by sound.
- the noisy environment or when the pronunciation of the other party is ambiguous it is necessary to observe the changes of the mouth shape and expressions in order to accurately understand the other party.
- the content of the talk The current speech recognition system ignores the visual characteristics of language perception, and only uses a single auditory characteristic, so that the existing speech recognition system has significantly reduced the recognition rate under the noise environment or multi-talker conditions, reducing the speech.
- the practicality of identification and the scope of application are also limited.
- the embodiment of the invention provides a method, a device and a computer storage medium for speech recognition, so as to at least solve the problem that the accuracy of the speech recognition is not high due to the fact that the user's speech content is only obtained by the user's voice in the related art.
- a method for voice recognition including: acquiring voice recognition information of a current voice of a user, and acquiring the voice recognition information based on a current state of the user corresponding to the current voice of the user. Identifying information; determining a final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- determining the final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information comprises: acquiring one or more first candidate vocabularies corresponding to the current voice of the user according to the voice recognition information; Obtaining a vocabulary category or one or more second candidate vocabulary corresponding to the current voice of the user according to the auxiliary identification information; determining a final voice of the current voice of the user according to the one or more first candidate vocabulary and the vocabulary type Identifying the result; or determining a final recognition result of the current voice of the user according to the one or more first candidate vocabulary and the one or more second candidate vocabulary.
- determining, according to the one or more first candidate vocabulary words and the vocabulary type, a final recognition result of the current voice of the user includes: selecting, from the one or more first candidate vocabulities, the vocabulary category a first specific vocabulary, the first specific vocabulary being used as a final recognition result of the current voice of the user.
- determining, according to the one or more first candidate vocabulary and the one or more second candidate vocabulary, a final recognition result of the current voice of the user includes: selecting, from the one or more second candidate vocabulary And a second specific vocabulary with high similarity to the one or more first candidate vocabularies, and the second specific vocabulary is used as a final recognition result of the current voice of the user.
- acquiring the auxiliary identification information of the voice recognition information based on the current state of the user corresponding to the current voice of the user includes: acquiring an image for indicating a current state of the user; acquiring image feature information according to the image; The image feature information acquires a vocabulary category and/or one or more candidate vocabularies corresponding to the image feature information, and uses the vocabulary category and/or the one or more candidate vocabularies as the auxiliary identification information.
- acquiring the vocabulary category and/or the one or more candidate vocabulary corresponding to the image feature information according to the image feature information comprises: searching for a specific image with the highest similarity with the image feature information in a predetermined image library And acquiring a vocabulary category or one or more candidate vocabularies corresponding to the specific image according to a preset relationship between the preset image and the vocabulary category or the one or more candidate vocabularies.
- the current state of the user includes at least one of the following: a lip motion state of the user, a throat vibration state of the user, a facial motion state of the user, and a gesture motion state of the user.
- a device for voice recognition comprising: an obtaining module, configured to acquire voice recognition information of a current voice of a user, and based on a current user corresponding to a current voice of the user The state acquires the auxiliary identification information of the voice recognition information; and the determining module is configured to determine a final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- the determining module includes: a first acquiring unit, configured to acquire one or more first candidate vocabularies corresponding to the current voice of the user according to the voice recognition information; and a second acquiring unit, configured to be configured according to the auxiliary
- the identification information acquires a vocabulary category or one or more second candidate vocabulary corresponding to the current voice of the user; and the determining unit is configured to determine, according to the one or more first candidate vocabulary and the vocabulary type, the current voice of the user Finalizing the result; or determining a final recognition result of the current voice of the user according to the one or more first candidate vocabulary and the one or more second candidate vocabulary.
- the determining unit is further configured to select, from the one or more first candidate words, a first specific vocabulary that matches the vocabulary category, and use the first specific vocabulary as a final recognition of the current voice of the user result.
- the determining unit is further configured to select, from the one or more second candidate words, a second specific vocabulary with high similarity to the one or more first candidate vocabulary, and the second specific vocabulary As the final recognition result of the current voice of the user.
- the acquiring module further includes: a third acquiring unit, configured to acquire an image for indicating a current state of the user; a fourth acquiring unit, configured to acquire image feature information according to the image; and a fifth acquiring unit, Providing to acquire a vocabulary category and/or one or more candidate vocabularies corresponding to the image feature information according to the image feature information, and using the vocabulary category and/or the one or more candidate vocabularies as the auxiliary recognition information.
- the fifth obtaining unit further includes: a searching subunit configured to search for a specific image having the highest similarity with the image feature information in a predetermined image library; and acquiring a subunit, which is set according to the preset image and A vocabulary category or a correspondence of one or more candidate vocabularies, acquiring a vocabulary category or one or more candidate vocabularies corresponding to the specific image.
- the current state of the user includes at least one of the following: a lip motion state of the user, a throat vibration state of the user, a facial motion state of the user, and a gesture motion state of the user.
- the apparatus further includes: a determining module, configured to determine that a correct rate of determining a final recognition result of the current voice of the user based on the voice recognition information is less than a predetermined threshold.
- a terminal including a processor, where the processor is configured to acquire voice recognition information of a current voice of a user, and acquire the current state based on a current user corresponding to the current voice of the user.
- the auxiliary identification information of the voice recognition information determining a final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- a computer storage medium having stored therein computer executable instructions for use in the above method of speech recognition.
- the voice recognition information of the current voice of the user is acquired, and the auxiliary identification information of the voice recognition information is obtained based on the current state of the user corresponding to the current voice of the user; and the final voice of the user is determined according to the voice recognition information and the auxiliary identification information. Identify the results. Therefore, the problem that the speech content of the user is not obtained by the user's voice only by the user's voice results in low accuracy, thereby improving the accuracy of the speech recognition.
- FIG. 1 is a flow chart of a voice recognition method according to an embodiment of the present invention.
- FIG. 2 is a block diagram showing the structure of a voice recognition apparatus according to an embodiment of the present invention.
- FIG. 3 is a structural block diagram (1) of a voice recognition apparatus according to an embodiment of the present invention.
- FIG. 4 is a structural block diagram (2) of a voice recognition apparatus according to an embodiment of the present invention.
- FIG. 5 is a structural block diagram (3) of a voice recognition apparatus according to an embodiment of the present invention.
- FIG. 6 is a structural block diagram (4) of a voice recognition apparatus according to an embodiment of the present invention.
- FIG. 7 is a flowchart of a voice recognition processing method according to an embodiment of the present invention.
- FIG. 8 is a structural block diagram of a speech recognition processing apparatus according to an embodiment of the present invention.
- FIG. 9 is a flow chart of a speech recognition process in accordance with an embodiment of the present invention.
- FIG. 1 is a flowchart of a voice recognition method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:
- Step S102 Acquire voice recognition information of a current voice of the user, and acquire auxiliary identification information of the voice recognition information based on a current state of the user corresponding to the current voice of the user;
- Step S104 determining a final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- the voice recognition information of the current voice of the user is obtained, and the state feature information of the user when the voice is sent is obtained, and the state feature information of the user when the voice is sent is used as the auxiliary information for identifying the current voice, compared to the prior art.
- the accuracy of the recognition of the voice is low only by the current voice of the user.
- the above steps solve the problem that the accuracy of the voice recognition is not high due to the fact that the voice content of the user is obtained only by the voice of the user in the related art, thereby improving the accuracy of the voice recognition. Sex.
- the step S104 includes determining the final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- acquiring one or more first candidate vocabularies corresponding to the current voice of the user according to the voice recognition information Obtaining a vocabulary category or one or more second candidate vocabulary corresponding to the current voice of the user according to the auxiliary identification information; determining a final recognition result of the current voice of the user according to the one or more first candidate vocabulary and the vocabulary type; or Or the plurality of first candidate words and the one or more second candidate words determine a final recognition result of the current voice of the user.
- first candidate vocabulary and vocabulary types There may be many ways to determine the final recognition result of the current voice of the user according to one or more first candidate vocabulary and vocabulary types.
- selecting a vocabulary category from one or more first candidate vocabulary The first specific vocabulary, the first specific vocabulary is used as the final recognition result of the current voice of the user.
- first obtaining the current state of the user is indicated. And acquiring image feature information according to the image, and acquiring a vocabulary category and/or one or more candidate vocabularies corresponding to the image feature information according to the image feature information, the vocabulary category and/or the one or more The candidate vocabulary is used as the auxiliary identification information.
- searching for a specific image with the highest degree of similarity to the image feature information in a predetermined image library and acquiring the specific image according to the correspondence between the preset image and the vocabulary category or one or more candidate vocabulary
- the vocabulary category corresponding to the image or one or more candidate vocabulary are acquired according to the image feature information.
- the current state of the user can include a variety of examples, which are exemplified below.
- the user's lip motion state, the user's throat vibration state, the user's facial motion state, the user's gesture motion state can include a variety of examples, which are exemplified below.
- the information included in the current state feature of the above-mentioned user is only illustrated as an example, and is not limited thereto.
- the content spoken by the speaker can only be recognized by the lip language. Therefore, lip language is an important auxiliary factor in recognizing speech.
- the voice recognition information of the current voice of the user is obtained, and before the auxiliary identification information of the voice recognition information is acquired based on the current state of the user corresponding to the current voice of the user, determining that the current user is determined based on the voice recognition information
- the correct rate of the final recognition result of the speech is less than a predetermined threshold.
- a device for voice recognition is provided, and the device is configured to implement the foregoing embodiments and preferred embodiments, and details are not described herein.
- the term "module” may implement a combination of software and/or hardware of a predetermined function.
- the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
- the apparatus includes: an obtaining module 22 configured to acquire voice recognition information of a current voice of a user, and based on a voice corresponding to the current voice of the user.
- the user's current state acquires the auxiliary identification information of the voice recognition information;
- the determining module 24 is configured to determine a final recognition result of the current voice of the user according to the voice recognition information and the auxiliary identification information.
- FIG. 3 is a structural block diagram (1) of a voice recognition apparatus according to an embodiment of the present invention.
- the determining module 24 includes: a first acquiring unit 242, configured to acquire, according to the voice recognition information, a current voice corresponding to the user. One or more first candidate vocabulary; the second obtaining unit 244 is configured to acquire according to the auxiliary identification information a vocabulary category corresponding to the current voice of the user or one or more second candidate vocabulary; the determining unit 246 is configured to determine a final recognition result of the current voice of the user according to the one or more first candidate vocabulary and the vocabulary type; or Determining a final recognition result of the current voice of the user according to the one or more first candidate vocabulary and the one or more second candidate vocabulary.
- the determining unit 246 is further configured to select a first specific vocabulary that matches the vocabulary category from the one or more first candidate vocabulary, and use the first specific vocabulary as a final recognition result of the current voice of the user.
- the determining unit 246 is further configured to select, from the one or more second candidate words, a second specific vocabulary with high similarity to the one or more first candidate vocabulary, the second specific vocabulary as the user The final recognition result of the current voice.
- the obtaining module 22 further includes: a third obtaining unit 222, configured to acquire an image for indicating a current state of the user;
- the fourth obtaining unit 224 is configured to acquire image feature information according to the image;
- the fifth obtaining unit 226 is configured to acquire a vocabulary category and/or one or more candidate vocabularies corresponding to the image feature information according to the image feature information, and
- the vocabulary category and/or the one or more candidate vocabularies are used as the auxiliary identification information.
- the fifth obtaining unit 226 further includes: a searching subunit 2262, configured to search for a picture in a predetermined image library. a specific image having the highest similarity of the feature information; the obtaining subunit 2264 is configured to acquire a vocabulary category or one or more candidates corresponding to the specific image according to a correspondence between the preset image and the vocabulary category or the one or more candidate vocabulary vocabulary.
- the current state of the user includes at least one of the following: a lip motion state of the user, a throat vibration state of the user, a facial motion state of the user, and a gesture motion state of the user.
- FIG. 6 is a structural block diagram (4) of a voice recognition apparatus according to an embodiment of the present invention. As shown in FIG. 6, the apparatus further includes: a determining module 26 configured to determine, based on the voice recognition information, a final identification of the current voice of the user. The correct rate of results is less than a predetermined threshold.
- a terminal including a processor, configured to acquire voice recognition information of a current voice of a user, and acquire the voice based on a current state of the user corresponding to the current voice of the user. Identifying the auxiliary identification information of the information; determining a final recognition result of the current voice of the user based on the voice recognition information and the auxiliary identification information.
- each of the above modules may be implemented by software or hardware.
- the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are respectively located.
- the first processor, the second processor, and the third processor In the first processor, the second processor, and the third processor.
- an object of the present exemplary embodiment is to provide an intelligent speech recognition method and apparatus based on an auxiliary interaction mode, which is used as a basic signal and combined with lip recognition based on speech recognition. , face recognition, gesture recognition, throat vibration recognition, etc., as an auxiliary signal.
- the technical modules are relatively independent and mutually integrated, which greatly improves the speech processing recognition rate.
- the increase of the auxiliary signal recognition can be determined by the speech recognition result, and the probability of the speech recognition result is less than The threshold increases the auxiliary data.
- the human-like language cognitive process is a multi-channel perception process. Let the terminal accurately understand the content of the speech based on the content of the speech perceived by the sound, in conjunction with the recognition of its mouth shape, facial changes, and the like.
- a voice recognition processing method is provided.
- the audio signal is acquired by the audio sensor as a basic signal for voice recognition, and the moving image of the human body is collected by a terminal device camera or an external sensor.
- a terminal device camera or an external sensor Including gesture movement, facial movement, throat vibration, lip recognition, etc., and through the integrated image algorithm and motion processing chip for analysis, as an auxiliary signal for speech recognition, the basic signal and auxiliary signal recognition results are processed by the terminal and executed accordingly. operating. Accumulating the auxiliary signal recognition result and the speech recognition basic signal result to form a unified recognition result, which plays an auxiliary role in speech recognition and improves the speech recognition rate.
- the first speech recognition is used as the basic signal for analysis and confirmation.
- the logical judgment sequence of the auxiliary signal for auxiliary judgment effectively reduces the probability of recognition errors due to noise and external sound interference.
- the feature data is collected by the sensor and the camera, the feature data is extracted, and a series of matching judgment and recognition are performed with the preset template library data, and then compared with the corresponding recognition feature results to identify the speech recognition. Possible candidate vocabulary in the model lexicon.
- the lip shape recognition captures a lip image of the speaker through the camera, performs image processing on the lip image, dynamically extracts the lip feature in real time, and then determines the speech content by using a lip pattern recognition algorithm.
- the combination of lip and lip color is used to accurately position the lips. Identification is performed using an appropriate lip matching algorithm.
- the lip shape recognition extracts the feature of the lip image from the preprocessed video data, uses the feature of the lip image to identify the current user's mouth shape change, detects the user's mouth motion to realize the lip shape recognition, and improves recognition. Efficiency and accuracy.
- the above-mentioned mouth motion feature maps are classified, the classification information is obtained, and the above-mentioned mouth motion feature maps are classified, and the mouth motion feature maps of each feature type correspond to a plurality of vocabulary categories.
- the lip shape recognition information is subjected to a series of processing such as denoising and analog-to-digital (A/D) conversion, and is compared with the template library data preset in the image/speech recognition processing module, and the lip identification information is compared.
- the similarity with all the mouth motion feature maps pre-sampled, and read some vocabulary categories corresponding to the mouth motion feature map with the highest similarity.
- the throat vibration recognition is performed by an external sensor to collect the throat vibration shape of the speaker, processing the vibration shape, dynamically extracting the vibration shape feature in real time, and then determining the speech content by using the vibration shape pattern recognition algorithm.
- the user's throat vibration motion characteristic map needs to be sampled, and different throat vibration motion characteristic files are established for different users.
- the user may send a syllable throat vibration motion map to sample, and may also sample the throat vibration motion map of the user.
- the vibrational motion of the throat is different. Since each speech event sent by the user is related, after the identification of the laryngeal vibration is completed, the identified throat is identified by using the context correction technique. The vibration is verified to reduce the recognition error of the vibrational motion map of the same type of throat, and further improve the accuracy of the throat vibration recognition.
- the above-mentioned throat vibration recognition extracts the characteristics of the throat vibration image from the pre-processed vibration data, uses the characteristics of the throat vibration image to identify the current user's throat vibration change, and detects the user's throat vibration motion to realize the throat Vibration identification improves recognition efficiency and accuracy.
- the above-mentioned throat vibration motion characteristic maps are classified, the classification information is obtained, and the above-mentioned throat vibration motion characteristic maps are classified, and the throat vibration motion characteristics of each feature type correspond to a plurality of vocabulary categories.
- the above-mentioned throat vibration recognition acquisition information is compared with the template library data preset in the image/speech recognition processing module, and the similarity of the throat vibration identification information with all pre-sampled throat vibration motion maps is compared. Read the vocabulary categories corresponding to the throat vibration motion map with the highest similarity.
- the above face recognition is used to extract the facial features of the user in the video data, and determine the identity and position of the user; the facial muscles also correspond to different sports modes when speaking, and the signal characteristics can be completely obtained by collecting the facial muscles.
- the corresponding muscle action mode is identified to assist in the recognition of the voice information.
- a speech recognition processing apparatus comprising: a basic signal module. Auxiliary signal module, signal processing module.
- the basic signal module is a traditional speech recognition module.
- the speech recognition module is configured to identify pre-processed audio data through an audio sensor; the recognition object of the speech recognition module includes speech recognition of isolated vocabulary and continuous large vocabulary speech recognition.
- the former is mainly used to determine control commands, and the latter is mainly used for text input.
- the identification of isolated vocabulary is mainly taken as an example, and the recognition of continuous large vocabulary uses the same processing method.
- the audio sensor is a microphone array or a directional microphone. Due to various forms of noise interference in the environment, the existing audio acquisition method based on ordinary microphone has the same sensitivity for user speech and environmental noise, and has no ability to distinguish between speech and noise, so it is easy to cause the correct operation rate of the user's speech recognition instruction. Decline.
- Using a microphone array or a directional microphone can overcome the above problems, using sound source localization and speech enhancement algorithms to track the user's voice and enhance its sound signal, suppress the influence of ambient noise and vocal interference, and improve the system voice input. The signal-to-noise ratio ensures that the back-end algorithm obtains reliable data quality.
- An auxiliary signal module including a front end camera, an audio sensor, a throat vibration sensor; configured to acquire video data, audio data, and motion data;
- the laryngeal vibration sensor is integrated in the wearable device, and the position is in contact with the user's throat to detect the voice vibration generated by the user.
- One temperature sensor is placed inside the wearable device, and one temperature sensor is placed on the outside of the wearable device.
- the processor determines whether the wearable device is worn by the user by comparing the temperatures detected by the two sensors, and the wearable device automatically enters the sleep mode without being worn, thereby reducing the overall power consumption of the wearable device.
- the microprocessor will detect the vibration sensor state judgment and recognize the voice command issued by the user, and send the voice command to the device to be controlled through the Bluetooth device to execute the voice recognition command.
- the signal processing unit comprises a lip recognition module, a face recognition module, a vibration recognition module, a gesture recognition module, a voice recognition module and a sub-adjustment module; and is configured to identify the basic signal (speech signal) and the auxiliary signal, and select the basic signal as The main voice information, the auxiliary signal is used as auxiliary voice information;
- the first basic signal (speech signal) is used as the basic signal for analysis and confirmation, and the auxiliary signal is used for auxiliary judgment.
- speech signal is used as the basic signal for analysis and confirmation
- auxiliary signal is used for auxiliary judgment.
- candidate words several words with the highest probability score obtained by the speech signal identification are selected as candidate words.
- auxiliary speech information generated by the auxiliary signal is used to improve the scores of the related words in the candidate words and related word sets in the speech recognition model vocabulary in the speech recognition model.
- the candidate word or related word with the highest score is selected as the recognition result.
- the lip recognition module is configured to extract the feature of the lip image from the preprocessed video data, and use the lip information to identify the current user's mouth shape change;
- the face recognition module is configured to extract user facial features in the video data, determine the identity and location of the user, and identify the identity of different registered users, which is mainly beneficial to the customization of the personalized operation of the entire device, such as different control rights. Granting, the user's location information can be used to assist the gesture recognition to determine the operating area of the user's hand, determine the orientation of the user when performing voice operations, to improve the audio input gain of the microphone user's orientation; when there are multiple possible users, the module It is able to identify the location of all faces and judge all user identities and process them separately. Ask the user which user in the camera's field of view will be given control;
- the gesture recognition module is configured to extract gesture information in the preprocessed video data, determine a hand shape, a motion track of the hand, and coordinate information of the hand in the image, thereby tracking any hand shape, and contours of the opponent in the image. For analysis, the user obtains the activation and control of the entire terminal through specific gestures or actions.
- the existing various forms of human-computer interaction technologies including gesture recognition, throat vibration recognition, speech recognition, face recognition, and lip recognition technology are integrated, and speech recognition is used as a basic signal.
- the lip recognition, face recognition, gesture recognition, throat vibration recognition, and the like are used as auxiliary signals to perform sub-adjustment of the speech recognition candidate words.
- the logical sequence of the basic signal (speech signal) is used as the basic signal for analysis and confirmation, and the auxiliary signal is used for auxiliary judgment.
- the advantages of each technology in its application field are used to complement each other.
- the technical modules are relatively independent and integrated with each other.
- the information identifies the current user's mouth shape change, and based on this, reduces the false positive rate when the user performs the voice recognition operation, so as to ensure that the voice operation can be recognized normally in the noise environment; the face recognition module recognizes the user's location information,
- the auxiliary gesture recognition is used to determine an operation area of the user's hand and determine an orientation when the user performs a voice operation to improve the audio input gain of the microphone user's orientation. Thereby overcoming the influence of noise, the speech recognition rate is significantly improved, and the result is converted into related instructions. It is very good to improve the stability of the terminal speech recognition and the comfort of operation.
- FIG. 7 is a flowchart of a voice recognition processing method according to an embodiment of the present invention. As shown in FIG. 7, the process includes:
- Step S702 performing voice recognition processing on the voice information acquired by the audio sensor as a basic signal
- step S704 lip recognition, face recognition, vibration recognition, and gesture recognition are performed as auxiliary signals, and the recognition result of the basic signals is adjusted.
- Speech recognition objects include speech recognition of isolated vocabulary and continuous large vocabulary speech recognition.
- the former is mainly used to determine control commands, and the latter is mainly used for text input.
- the identification of isolated vocabulary is taken as an example.
- the recognition of continuous large vocabulary uses the same processing method.
- the first basic signal speech signal
- the auxiliary signal is used for auxiliary judgment.
- the selected number of words with the highest probability score is selected as the candidate word.
- the candidate word category with the highest probability score generated by the auxiliary signal identification is used as auxiliary information, and several candidate words identified by the basic signal are sequentially determined.
- the candidate word category identified by the auxiliary signal is met, the candidate word and related word set are improved.
- lip recognition, face recognition, vibration recognition, and gesture recognition are used as auxiliary signals for recognition processing.
- the various recognition methods are independent of each other, and one or more recognition methods can be simultaneously used as auxiliary signal input.
- modules or units in the device may be code stored in a memory or a user terminal and operable by the processor, or may be implemented in other manners, and will not be exemplified herein.
- FIG. 8 is a structural block diagram of a voice recognition processing apparatus according to an embodiment of the present invention. As shown in FIG. 8, the apparatus includes:
- the basic signal module including the audio sensor, is a traditional voice recognition module, and the voice recognition module is configured to identify the pre-processed audio data through the audio sensor;
- the auxiliary signal module includes a front end camera and a throat vibration sensor; and is configured to acquire video data, audio data, and motion data, including lip recognition, face recognition, throat vibration recognition, gesture recognition, and the like;
- the signal processing module comprises a lip recognition module, a face recognition module, a vibration recognition module, a gesture recognition module, a voice recognition module and a sub-adjustment module; and is configured to identify the basic signal (speech signal) and the auxiliary signal, and select the basic signal as The main voice information, the auxiliary signal is adjusted as auxiliary information;
- the lip recognition module is configured to extract the feature of the lip image from the preprocessed video data, and use the lip information to identify the current user's mouth shape change;
- the face recognition module is configured to extract user facial features in the video data, determine the identity and location of the user, and identify the identity of different registered users, which is mainly beneficial to the customization of the personalized operation of the entire device, such as different control rights. Grant
- the gesture recognition module is configured to extract gesture information in the preprocessed video data, determine a hand shape, a motion track of the hand, and coordinate information of the hand in the image, thereby tracking any hand shape, and contours of the opponent in the image. Perform analysis, the user obtains the activation and control of the entire terminal through specific gestures or actions;
- FIG. 9 is a flowchart of a speech recognition processing method according to the present invention. As shown in FIG. 9, the speech recognition method of this embodiment is as follows:
- Step S902 the voice information acquired from the audio sensor, obtaining video data and motion data from the front end camera and the throat vibration sensor, including lip shape recognition, face recognition, throat vibration recognition, gesture recognition, and the like;
- Step S904 taking the speech recognition of the isolated vocabulary as an example, identifying and confirming the speech signal as a basic signal, and identifying the isolated vocabulary to obtain the most probable words as the candidate words;
- Step S906 collecting a moving image of the human body, including gesture motion, facial motion, throat vibration, lip shape recognition, etc. as an auxiliary signal to the terminal device camera or an external sensor, and performing analysis and confirmation to obtain candidate words with the highest probability score. category;
- Step S908 sequentially determining a plurality of candidate words identified by the basic signal, and if the candidate word categories identified by the auxiliary signal are met, increasing the score of the candidate words in the phonetic model lexicon;
- Step S910 after all the basic signals and the auxiliary signals are processed, the candidate words with the highest score are selected as the recognition result.
- gestures, facial movements, throat vibrations, lip recognition, and the like are combined or only one or more of them are used as auxiliary signals for recognition, and the candidate word categories with the highest probability scores are obtained.
- the business card holder (0.9) call (0.9) recognized by the voice signal is sequentially determined to determine whether the candidate word category recognized by the auxiliary signal is met. Suppose the business card holder matches the candidate category. Then increase the probability score of the business card holder, for example, update to the business card holder (1.0) call (0.9).
- the candidate word card holder (1.0) with the highest score is selected as the recognition result.
- the logical signal sequence for determining the candidate word category may be identified by using the first auxiliary signal, and then the voice signal is used as the basic signal for analysis and confirmation. Firstly, by gesture movement, facial movement, throat vibration, lip recognition, etc., or only one or more of them are used as auxiliary signals for recognition. When using multiple methods for recognition, each method is recognized. As a result, the candidate word categories with the highest probability scores are obtained by the cumulative processing. Based on the speech recognition results, the words with the highest probability score are selected as the final recognition results.
- the present solution will be described below with a specific example. For example, by identifying the voice of the owner, the following results are obtained:
- the combination of laryngeal vibration and lip shape recognition is used as an auxiliary signal for identification. It is assumed that the first part is the laryngeal vibration identification, and the card holder (0.9) call (0.9) identified by the basic signal is sequentially determined to determine whether the throat is met. The vibration recognizes the identified candidate word categories. Assuming that the business card holder conforms to the category of throat vibration recognition, the probability score of the business card holder is increased, for example, updated to the business card holder (1.0) call (0.9). On the basis of the last recognition result, the lip shape recognition judgment is continued, and the business card holder (1.0) call (0.9) is sequentially judged to determine whether the candidate word category of the lip recognition is satisfied. Assuming that the business card holder conforms to the category of lip recognition, the likelihood score of the business card holder is increased, for example, updated to a business card holder (1.1) call (0.9). The recognition results of the two methods are cumulatively processed.
- the candidate word card holder (1.1) with the highest score is selected as the recognition result.
- the process of further screening is performed by sub-adjusting, that is, the score of the candidate word that meets the auxiliary signal identification may be increased, and the score of the candidate word that does not meet the auxiliary signal identification may also be reduced. After the basic signal and the auxiliary signal are all processed, the candidate with the highest score is selected as the recognition result.
- the confirmation of the recognition result by the auxiliary information added to improve the speech recognition accuracy is optional for the user, and the speech recognizer determines the recognition result according to the input speech.
- a likelihood metric is calculated for the above recognition result. If the likelihood metric is less than the threshold, the user is prompted whether to enter the auxiliary data or automatically turn on the auxiliary data identification. If the likelihood metric is greater than the threshold, the user is prompted whether to close the auxiliary data or automatically turn off the auxiliary data identification.
- the specific value of the threshold is not limited, and is derived from experience or based on user experience.
- the existing various forms of human-computer interaction technologies including gesture recognition, throat vibration recognition, speech recognition, face recognition, lip recognition technology, etc.
- speech recognition is performed.
- As a basic signal use lip recognition, face recognition, gesture recognition, throat vibration recognition, etc.
- the auxiliary signal performs sub-adjustment of the speech recognition candidate words.
- the logic sequence of the basic signal (speech signal) as the basic signal for analysis and confirmation and the auxiliary signal for auxiliary judgment is used to improve the stability of the terminal speech recognition and the comfort of operation.
- the speech recognition processing method and apparatus provided by the present invention are used as a basic signal, and combined with lip recognition, face recognition, gesture recognition, and throat vibration recognition as auxiliary signals on the basis of speech recognition. signal.
- the problem of poor user experience caused by low speech recognition rate in the related art is solved. Taking advantage of the advantages of each technology in its application field, the strengths and weaknesses of each technology module are relatively independent and mutually integrated, which greatly improves the recognition rate of speech processing.
- a storage medium is further provided, wherein the software includes the above-mentioned software, including but not limited to: an optical disk, a floppy disk, a hard disk, an erasable memory, and the like.
- the embodiment of the invention further describes a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are configured to execute the voice recognition method shown in FIG.
- the disclosed apparatus and method may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed.
- the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
- the units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units, that is, may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit;
- the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
- the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
- the foregoing storage medium includes: a mobile storage device, a random access memory (RAM), a read-only memory (ROM), a magnetic disk, or an optical disk.
- RAM random access memory
- ROM read-only memory
- magnetic disk or an optical disk.
- optical disk A medium that can store program code.
- the above-described integrated unit of the present invention may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a standalone product.
- the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product, which is stored in a storage medium and includes a plurality of instructions for making
- a computer device which may be a personal computer, server, or network device, etc.
- the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a RAM, a ROM, a magnetic disk, or an optical disk.
- modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
- the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
- the invention is not limited to any specific combination of hardware and software.
- a method, an apparatus, and a computer storage medium for voice recognition provided by the embodiments of the present invention have the following beneficial effects: effectively improving the accuracy of voice recognition of a user's speech content only by the user's voice, thereby improving The accuracy of speech recognition.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
一种语音识别的方法、装置及计算机存储介质,其中,该方法包括:获取用户当前语音的语音识别信息,以及基于与用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息(S102);根据语音识别信息和辅助识别信息确定用户当前语音的最终识别结果(S104)。该方法解决了相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题,进而提高了语音识别的准确性。
Description
本发明涉及通信领域,具体而言,涉及一种语音识别的方法、装置及计算机存储介质。
语音识别技术随着计算机和相关软硬件技术的发展,已越来越多的应用在各个领域,其识别率也在不断的提高。在环境安静、发音标准等特定条件下,目前应用在语音识别输入文字系统的识别率已经达到95%以上。常规语音识别技术已比较成熟,针对移动终端的语音识别,由于语音质量相对于普通语音识别场景相对较差,因此语音识别效果受到限制。这里语音质量很差包括如下的原因,例如客户端有背景噪声、客户端语音采集设备、通话设备的噪声、通信线路的噪声和干扰、还有本身说话带有口音或者使用了方言、说话人本身的说话含糊或者不清楚等。所有这些因素都可能造成语音识别效果变差。其识别率受到很多因素的影响,针对相关技术中语音识别率低而导致的用户体验度差的问题,目前尚未提出有效的解决方案。在车上或噪声较大、发音不标准的情况下,其识别率将大打折扣,以至于无法达到真正实用目的。其正确识别率低,影响精确操控,效果不够理想。若能采用其它方法来辅助判断以提高其语音识别的准确率,那么语音识别的实用性将显著提高。
人类的语言认知过程是一个多通道的感知过程。在人与人日常交流的过程中,通过声音来感知他人讲话的内容,在喧闹的环境或对方发音模糊不清时,还需要眼睛观察其口型,表情等的变化,才能准确地理解对方所讲的内容。现行的语音识别系统忽略了语言感知的视觉特性这一面,仅仅利用了单一的听觉特性,使得现有的语音识别系统在噪声环境或多话者条件下,其识别率都显著下降,降低了语音识别的实用性,应用范围也受限制。
针对相关技术中,仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题,还未提出有效的解决方案。
发明内容
本发明实施例提供了一种语音识别的方法、装置及计算机存储介质,以至少解决相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题。
根据本发明实施例的一个方面,提供了一种语音识别的方法,包括:获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息;根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
进一步地,根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果包括:根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇;根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果;或者,根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。
进一步地,根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果包括:从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇,将所述第一特定词汇作为所述用户当前语音的最终识别结果。
进一步地,根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果包括:从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇,将所述第二特定词汇作为所述用户当前语音的最终识别结果。
进一步地,基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息包括:获取用于指示所述用户当前状态的图像;根据所述图像获取图像特征信息;根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。
进一步地,根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇包括:在预定的图像库中查找与所述图像特征信息相似度最高的特定图像;根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。
进一步地,所述用户当前状态包括以下至少之一:所述用户的唇形运动状态、所述用户的喉部振动状态、所述用户的脸部运动状态、所述用户的手势运动状态。
进一步地,获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息之前包括:判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。
根据本发明实施例的另一个方面,提供了一种语音识别的装置,所述装置包括:获取模块,设置为获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息;确定模块,设置为根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
进一步地,所述确定模块包括:第一获取单元,设置为根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇;第二获取单元,设置为根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;确定单元,设置为根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果;或者,根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。
进一步地,所述确定单元还设置为从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇,将所述第一特定词汇作为所述用户当前语音的最终识别结果。
进一步地,所述确定单元还设置为从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇,将所述第二特定词汇作为所述用户当前语音的最终识别结果。
进一步地,所述获取模块还包括:第三获取单元,设置为获取用于指示所述用户当前状态的图像;第四获取单元,设置为根据所述图像获取图像特征信息;第五获取单元,设置为根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。
进一步地,所述第五获取单元还包括:查找子单元,设置为在预定的图像库中查找与所述图像特征信息相似度最高的特定图像;获取子单元,设置为根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。
进一步地,所述用户当前状态包括以下至少之一:所述用户的唇形运动状态、所述用户的喉部振动状态、所述用户的脸部运动状态、所述用户的手势运动状态。
进一步地,所述装置还包括:判定模块,设置为判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。
根据本发明实施例的另一个方面,还提供了一种终端,包括处理器,所述处理器设置为获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息;根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
根据本发明实施例的再一个方面,还提供了一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于上述的语音识别的方法。
通过本发明实施例,获取用户当前语音的语音识别信息,以及基于与用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息;根据语音识别信息和辅助识别信息确定用户当前语音的最终识别结果。由此解决了相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题,进而提高了语音识别的准确性。
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的语音识别方法的流程图;
图2是根据本发明实施例的语音识别装置的结构框图;
图3是根据本发明实施例的语音识别装置的结构框图(一);
图4是根据本发明实施例的语音识别装置的结构框图(二);
图5是根据本发明实施例的语音识别装置的结构框图(三);
图6是根据本发明实施例的语音识别装置的结构框图(四);
图7是根据本发明实施例的语音识别处理方法的流程图;
图8根据本发明实施例的语音识别处理装置的结构框图;
图9是根据本发明实施例的语音识别处理流程图。
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
在本实施例中提供了一种语音识别的方法,图1是根据本发明实施例的语音识别方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,获取用户当前语音的语音识别信息,以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息;
步骤S104,根据语音识别信息和辅助识别信息确定用户当前语音的最终识别结果。
通过上述步骤,获取用户当前语音的语音识别信息,并且获取用户在发出语音时的状态特征信息,将用户在发出语音时的状态特征信息作为识别当前语音的辅助信息,相比于现有技术中仅通过用户的当前语音进行语音的识别准确率较低,上述步骤解决了相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题,进而提高了语音识别的准确性。
上述步骤S104中涉及根据语音识别信息和辅助识别信息确定该用户当前语音的最终识别结果,在一个可选实施例中,根据语音识别信息获取用户当前语音对应的一个或者多个第一候选词汇;根据辅助识别信息获取该用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;根据一个或者多个第一候选词汇和该词汇类型确定该用户当前语音的最终识别结果;或者,根据一个或者多个第一候选词汇和一个或者多个第二候选词汇确定用户当前语音的最终识别结果。
根据一个或者多个第一候选词汇和词汇类型确定该用户当前语音的最终识别结果的方式可以有很多种,在一个可选实施例中,从一个或者多个第一候选词汇中选择符合词汇类别的第一特定词汇,将第一特定词汇作为该用户当前语音的最终识别结果。
在另一个可选实施例中,从一个或者多个第二候选词汇中选择与一个或者多个第一候选词汇相似度高的第二特定词汇,将第二特定词汇作为用户当前语音的最终识别结果。
上述在根据一个或者多个第一候选词汇和一个或者多个第二候选词汇确定该用户当前语音的最终识别结果的过程中,在一个可选实施例中,首先获取用于指示该用户当前状态的图像,然后根据该图像获取图像特征信息,再根据该图像特征信息获取与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将该词汇类别和/或该一个或者多个候选词汇作为该辅助识别信息。
在一个可选实施例中,在预定的图像库中查找与该图像特征信息相似度最高的特定图像,根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与该特定图像对应的词汇类别或者一个或者多个候选词汇。从而根据图像特征信息获取到了与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇。
用户当前状态可以包括多种,下面对此进行举例说明。在一个可选实施例中,该用户的唇形运动状态、该用户的喉部振动状态、该用户的脸部运动状态、该用户的手势运动状态。上述用户的当前状态特征所包括的信息仅作为举例说明,对此不作限制。例如在现实生活中,仅可以通过唇语即可识别说话者所说的内容。因此,唇语是识别语音的重要的辅助因素。
在一个可选实施例中,获取用户当前语音的语音识别信息,以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息之前,判定基于该语音识别信息确定该用户当前语音的最终识别结果的正确率小于预定阈值。
在本实施例中还提供了一种语音识别的装置,该装置设置为实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图2是根据本发明实施例的语音识别装置的结构框图,如图2所示,该装置包括:获取模块22,设置为获取用户当前语音的语音识别信息,以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息;确定模块24,设置为根据该语音识别信息和该辅助识别信息确定该用户当前语音的最终识别结果。
图3是根据本发明实施例的语音识别装置的结构框图(一),如图3所示,确定模块24包括:第一获取单元242,设置为根据该语音识别信息获取该用户当前语音对应的一个或者多个第一候选词汇;第二获取单元244,设置为根据该辅助识别信息获取
该用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;确定单元246,设置为根据该一个或者多个第一候选词汇和该词汇类型确定该用户当前语音的最终识别结果;或者,根据该一个或者多个第一候选词汇和该一个或者多个第二候选词汇确定该用户当前语音的最终识别结果。
可选地,确定单元246还设置为从该一个或者多个第一候选词汇中选择符合该词汇类别的第一特定词汇,将该第一特定词汇作为该用户当前语音的最终识别结果。
可选地,确定单元246还设置为从该一个或者多个第二候选词汇中选择与该一个或者多个第一候选词汇相似度高的第二特定词汇,将该第二特定词汇作为该用户当前语音的最终识别结果。
图4是根据本发明实施例的语音识别装置的结构框图(二),如图4所述,获取模块22还包括:第三获取单元222,设置为获取用于指示该用户当前状态的图像;第四获取单元224,设置为根据该图像获取图像特征信息;第五获取单元226,设置为根据该图像特征信息获取与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将该词汇类别和/或该一个或者多个候选词汇作为该辅助识别信息。
图5是根据本发明实施例的语音识别装置的结构框图(三),如图5所示,第五获取单元226还包括:查找子单元2262,设置为在预定的图像库中查找与该图像特征信息相似度最高的特定图像;获取子单元2264,设置为根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与该特定图像对应的词汇类别或者一个或者多个候选词汇。
可选地,用户当前状态包括以下至少之一:该用户的唇形运动状态、该用户的喉部振动状态、该用户的脸部运动状态、该用户的手势运动状态。
图6是根据本发明实施例的语音识别装置的结构框图(四),如图6所示,该装置还包括:判定模块26,设置为判定基于该语音识别信息确定该用户当前语音的最终识别结果的正确率小于预定阈值。
根据本发明实施例的另一个方面,还提供了一种终端,包括处理器,该处理器设置为获取用户当前语音的语音识别信息,以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息;根据该语音识别信息和该辅助识别信息确定该用户当前语音的最终识别结果。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述各个模块均位于同一处理器中;或者,上述各个模块分别位于第一处理器、第二处理器和第三处理器…中。
针对相关技术中存在的上述问题,下面结合具体的可选实施例进行说明,在下述可选实施例中结合了上述可选实施例及其可选实施方式。
本可选实施例提供了一种语音识别处理方法及装置,以解决相关技术中语音识别率低而导致的用户体验度差的问题。为了克服现有技术的上述缺点与不足,本可选实施例的目的在于提供一种基于辅助交互方式的智能语音识别方法和装置,在语音识别的基础上,作为基本信号,配合使用唇形识别、人脸识别、手势识别、喉部振动识别等,作为辅助信号。利用各技术在其应用领域的优势,取长补短,各技术模块相对独立又相互融合,大大提高语音处理识别率,优选的,辅助信号识别的增加可以由语音识别结果决定,当语音识别结果可能性小于阈值则增加辅助数据。符合人类的语言认知过程是一个多通道的感知过程。让终端基于通过声音来感知讲话的内容,配合识别其口型,面部变化等准确地理解所讲的内容。
根据本可选实施例的一个方面,提供了一种语音识别处理方法,通过音频传感器获取音频数据作为基本信号进行语音识别的基础上,通过终端设备摄像头或者外置的传感器采集人体的运动图像,包括手势运动、面部运动、喉部振动,唇形识别等,并通过集成的图像算法和动作处理芯片进行解析,作为语音识别的辅助信号,基本信号和辅助信号识别结果由终端综合处理并执行相应操作。将辅助信号识别结果与语音识别基本信号结果进行累加处理形成统一的识别结果,对语音识别起辅助作用,提高语音识别率。
将手势运动、面部运动、喉部振动,唇形识别综合起来、每种方式都通过特征提取、模板训练、模板分类、判决过程有机的结合起来,运用先语音识别作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列、有效的降低因噪声和外界声音干扰产生识别错误的几率。在辅助信号识别过程中,通过传感器和摄像头采集特征数据,进行特征数据提取,与预置的模板库数据进行一系列匹配判断识别,再与相应的识别特征结果进行比对,识别出在语音识别模型词库中可能的候选词词汇。
可选地,上述唇形识别通过摄像头采集说话者的唇形图像,对唇形图像进行图像处理,实时动态提取唇形特征,然后用唇形模式识别算法确定说话内容。采用唇形和唇色相结合的判断方法,准确定位口唇位置。采用适当的唇形匹配算法进行识别。
可选地,上述唇形识别对预处理后的视频数据取出唇形图像的特征,利用唇形图像的特征识别当前用户的嘴型变化;探测用户嘴部运动来实现唇形的识别,提高识别效率和准确率。对上述嘴部运动特征图进行分类,获得分类信息,将上述嘴部运动特征图进行归类,每种特征类型的嘴部运动特征图都对应有若干词汇类别。上述唇形识别获取信息,经过去噪、模数(A/D)转换等一系列处理后,分别与预设在图像/语音识别处理模块中的模板库数据比对,比较上述唇形识别信息的与预先采样的所有嘴部运动特征图的相似度,读取相似度最高的嘴部运动特征图所对应的若干词汇类别。
可选地,上述喉部振动识别通过外置传感器采集说话者的喉部振动形态,对振动形态进行处理,实时动态提取振动形态特征,然后用振动形态模式识别算法确定说话内容。
可选地,在对用户进行喉部振动识别之前,需先对用户的喉部振动运动特征图进行采样,对不同用户建立不同的喉部振动运动特征档案。在预先采样用户的喉部振动运动特征图时,可对用户发出一个音节的喉部振动运动特征图进行采样,也可对用户发出一个单词的喉部振动运动特征图进行采样。对于发音不同的语音事件,喉部振动运动不同,由于用户发出的每个语音事件之间是相关的,在完成对喉部振动的识别后,通过使用上下文的纠错技术,对识别的喉部振动进行验证,减少同类别喉部振动运动特征图的识别错误,进一步提高喉部振动识别的准确率。
可选地,上述喉部振动识别对预处理后的振动数据取出喉部振动图像的特征,利用喉部振动图像的特征识别当前用户的喉部振动变化;探测用户喉部振动运动来实现喉部振动的识别,提高识别效率和准确率。对上述喉部振动运动特征图进行分类,获得分类信息,将上述喉部振动运动特征图进行归类,每种特征类型的喉部振动运动特征都对应有若干词汇类别。上述喉部振动识别获取信息,分别与预设在图像/语音识别处理模块中的模板库数据比对,比较上述喉部振动识别信息的与预先采样的所有喉部振动运动特征图的相似度,读取相似度最高的喉部振动运动特征图所对应的若干词汇类别。
上述人脸识别用于对视频数据中用户脸部特征进行提取,对用户的身份和位置进行确定;说话时面部肌肉也对应着不同的运动模式,通过采集面部肌肉的动作,完全可以从信号特征中识别对应的肌肉动作模式,进而辅助进行识别语音信息。
根据本可选实施例的一个方面,还提供了一种语音识别处理装置,包括:基本信号模块。辅助信号模块、信号处理模块。
基本信号模块,为传统的语音识别模块,上述语音识别模块通过音频传感器设置为对预处理后的音频数据进行识别;语音识别模块的识别对象包括孤立词汇的语音识别和连续大词汇量的语音识别,前者主要用来确定控制指令,后者主要用于文本的输入。在本发明中主要以孤立词汇的识别为例进行说明,连续大词汇量的识别采用相同的处理方式。
可选地,音频传感器为麦克风阵列或指向性麦克风。由于环境中存在各种形式的噪声干扰,而现有基于普通麦克风的音频获取方式对于用户语音及环境噪声具有相同的灵敏度,没有区别语音与噪声的能力,因此容易造成用户语音识别指令操作正确率的下降。使用麦克风阵列或指向性麦克风可以克服上述问题,使用声源定位与语音增强算法跟踪操作用户的声音并对其声音信号进行增强,抑制周围环境噪声及人声干扰的影响,提高系统语音音频输入的信噪比,保证后端算法获取数据质量的可靠。
辅助信号模块,包括前端摄像头、音频传感器、喉部振动传感器;设置为获取视频数据、音频数据和动作数据;
可选地,喉部振动传感器集成于可穿戴设备,位置和用户喉部接触,检测用户产生的语音振动,一个温度传感器放置于可穿戴设备内侧,一个温度传感器放置于可穿戴设备的外侧,微处理器通过比较两个传感器检测的温度,判断可穿戴设备是否被用户穿戴,可穿戴设备在不被穿戴的状况下,将自动进入到休眠模式,降低可穿戴设备整体功耗。微处理器将检测振动传感器状态判断并识别用户发出的语音指令,并将语音指令通过蓝牙设备发送到需要控制的设备,执行语音识别指令。
信号处理单元,包括唇形识别模块、人脸识别模块、振动识别模块、手势识别模块、语音识别模块和分调整模块;设置为对基本信号(语音信号)和辅助信号进行识别,选择基本信号作为主要的语音信息,将辅助信号作为辅助语音信息;
运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列,具体识别过程中,选择语音信号识别得出的可能性分值最高的若干个词作为候选词,设置为对于每个候选词,根据预定的词表生成多级相关词集合。辅助信号产生的辅助语音信息用于提高语音识别模型中候选词和相关词集合中的相关词在语音别模型词库中的分值。当基本信号和辅助信号全部处理完毕后,选择分值最高的候选词或相关词作为识别结果。
上述唇形识别模块设置为对预处理后的视频数据取出唇形图像的特征,利用唇形信息识别当前用户的嘴型变化;
上述人脸识别模块设置为对视频数据中用户脸部特征进行提取,对用户的身份和位置进行确定,识别出不同注册用户的身份主要是有利于整个装置个性化操作的定制,如不同控制权的授予,用户的位置信息可以用于辅助手势识别确定用户手的操作区域、确定用户进行语音操作时的方位,以提高麦克风用户方位的音频输入增益;当有多个可能的用户时,此模块能够识别出所有人脸的位置,并对所有用户身份进行判断,并分别进行处理。问用户哪位摄像头视野中的用户将被授予控制权;
上述手势识别模块设置为对预处理后的视频数据中手势信息进行提取,确定手型、手的运动轨迹、手在图像中的坐标信息,进而对任意手型进行跟踪,对手在图像中的轮廓进行分析,用户通过特定的手势或动作以获得整个终端的启动和控制权。
通过可选实施例,对现有的各种形式的人机交互技术,包括手势识别、喉部振动识别、语音识别、人脸识别、唇形识别技术等进行融合,语音识别作为基本信号,配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为辅助信号进行语音识别候选词的分调整。运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列,利用各技术在其应用领域的优势,取长补短,各技术模块相对独立又相互融合,利用唇形信息识别当前用户的嘴型变化,以此为依据降低用户进行语音识别操作时的误判率,以保证在噪声环境中语音操作也能正常识别;人脸识别模块识别出用户的位置信息,可以用于辅助手势识别确定用户手的操作区域、确定用户进行语音操作时的方位,以提高麦克风用户方位的音频输入增益。从而克服噪音的影响,显著提高了语音识别率,再把结果转化成相关指令。很好地做到了提升终端语音识别稳定与操作的舒适。
在附图的流程图示出的步骤可以在用户终端诸如智能手机、平板电脑等中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
本实施例提供了一种语音识别处理方法,图7是根据本发明实施例的语音识别处理方法的流程图,如图7所示,该流程包括:
步骤S702,将音频传感器获取的语音信息作为基本信号进行识别处理;
步骤S704,将唇形识别、人脸识别、振动识别、手势识别作为辅助信号进行识别处理,并对基本信号的识别结果进行分调整。
语音识别对象包括孤立词汇的语音识别和连续大词汇量的语音识别,前者主要用来确定控制指令,后者主要用于文本的输入。在本实施例中以孤立词汇的识别为例进
行说明,连续大词汇量的识别采用相同的处理方式。通过上述各个步骤,采用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列,选择语音信号识别得出的可能性分值最高的若干个词作为候选词,用于对于每个候选词,根据预定的词表生成多级相关词集合。辅助信号识别产生的可能性分值最高的候选词类别作为辅助信息,依次判断基本信号识别出的若干个候选词,如果符合辅助信号识别出的候选词类别,则提高该候选词和相关词集合中的相关词在语音别模型词库中的分值。当基本信号和辅助信号全部处理完毕后,选择分值最高的候选词或相关词作为识别结果。
在具体实施过程中,唇形识别、人脸识别、振动识别、手势识别作为辅助信号进行识别处理,各种识别方式是相互独立的,可以同时使用一个或多个识别方式作为辅助信号输入。
在实施例中还提供了一种装置,该装置与上述实施例中的方法相对应,已经进行过说明的在此不再赘述。该装置中的模块或单元可以是存储在存储器或用户终端中并可以被处理器运行的代码,也可以用其他方式实现,在此不再一一举例。
根据本发明实施例的一个方面,还提供了一种语音识别处理装置,图8是根据本发明实施例的语音识别处理装置的结构框图,如图8所示,该装置包括:
基本信号模块,包括音频传感器、为传统的语音识别模块,上述语音识别模块通过音频传感器设置为对预处理后的音频数据进行识别;
辅助信号模块,包括前端摄像头、喉部振动传感器;设置为获取视频数据、音频数据和动作数据,包括唇形识别、人脸识别、喉部振动识别、手势识别等;
信号处理模块,包括唇形识别模块、人脸识别模块、振动识别模块、手势识别模块、语音识别模块和分调整模块;设置为对基本信号(语音信号)和辅助信号进行识别,选择基本信号作为主要的语音信息,将辅助信号作为辅助信息进行分调整;
上述唇形识别模块设置为对预处理后的视频数据取出唇形图像的特征,利用唇形信息识别当前用户的嘴型变化;
上述人脸识别模块设置为对视频数据中用户脸部特征进行提取,对用户的身份和位置进行确定,识别出不同注册用户的身份主要是有利于整个装置个性化操作的定制,如不同控制权的授予;
上述手势识别模块设置为对预处理后的视频数据中手势信息进行提取,确定手型、手的运动轨迹、手在图像中的坐标信息,进而对任意手型进行跟踪,对手在图像中的轮廓进行分析,用户通过特定的手势或动作以获得整个终端的启动和控制权;
图9是根据本发明语音识别处理方法的流程图,如图9所示,该实施例的语音识别方法如下:
步骤S902,从音频传感器获取的语音信息,从前端摄像头、喉部振动传感器获取视频数据、动作数据,包括唇形识别、人脸识别、喉部振动识别、手势识别等信息;
步骤S904,以孤立词汇的语音识别为例,对语音信号作为基本信号进行识别确认,识别该孤立词汇得到该可能性最大的若干个词作为候选词;
步骤S906,对终端设备摄像头或者外置的传感器采集人体的运动图像,包括手势运动、面部运动、喉部振动,唇形识别等作为辅助信号,进行分析确认,得到可能性分值最高的候选词类别;
步骤S908,依次判断基本信号识别出的若干个候选词,如果符合辅助信号识别出的候选词类别,则提高该候选词在语音别模型词库中的分值;
步骤S910,当基本信号和辅助信号全部处理完毕后,选择分值最高的候选词作为识别结果。
下面以一个具体示例对本可选实施例进行说明。例如通过对机主的语音进行识别,得到以下结果:
“请(0.6)名片夹(0.9)呼叫(0.9)浏览器(0.7),其中括号中的数值为可能性分值值,代表可能性大小,分值越大可能性越大。选择可能性分值最高的词为候选词,例如选择如下的候选词:名片夹(0.9)呼叫(0.9)作为语音识别结果。
同时进行的手势运动、面部运动、喉部振动,唇形识别等多种方式组合或者只使用其中一种或多种方式作为辅助信号进行识别,得到可能性分值最高的候选词类别。
依次判断语音信号识别出的名片夹(0.9)呼叫(0.9),判断是否符合辅助信号识别出的候选词类别。假设名片夹符合候选词类别。则提高名片夹的可能性分值,例如更新为名片夹(1.0)呼叫(0.9)。
当语音基本信号和辅助信号全部处理完毕后,选择分值最高的候选词名片夹(1.0)作为识别结果。
作为本实施例的可选实施例,可以运用先辅助信号识别确定候选词类别,后通过语音信号作为基本信号进行分析确认的逻辑判断序列。先通过手势运动、面部运动、喉部振动,唇形识别等多种方式组合或者只使用其中一种或多种方式作为辅助信号进行识别,当使用多种方式进行识别时,每种方式的识别结果累加处理,得到可能性分值最高的候选词类别,在此的基础上结合语音识别结果,从中选择可能性分值最高的词为最终识别结果。下面以一个具体示例对本方案进行说明。例如通过对机主的语音进行识别,得到以下结果:
“请(0.6)名片夹(0.9)呼叫(0.9)浏览器(0.7),其中括号中的数值为可能性分值。选择可能性分值最高的词为候选词,例如选择如下的候选词:名片夹(0.9)呼叫(0.9)作为语音识别结果。
同时进行的喉部振动和唇形识别两种方式组合作为辅助信号进行识别,假设首先是喉部振动识别,依次判断基本信号识别出的名片夹(0.9)呼叫(0.9),判断是否符合喉部振动识别识别出的候选词类别。假设名片夹符合喉部振动识别的类别,则提高名片夹的可能性分值,例如更新为名片夹(1.0)呼叫(0.9)。在上一次识别结果的基础上继续进行唇形识别判断,依次判断名片夹(1.0)呼叫(0.9),判断是否符合唇形识别的候选词类别。假设名片夹符合唇形识别的类别,则提高名片夹的可能性分值,例如更新为名片夹(1.1)呼叫(0.9)。两种方式的识别结果进行了累加处理。
当语音基本信号和辅助信号全部处理完毕后,选择分值最高的候选词名片夹(1.1)作为识别结果。
作为本实施例的可选实施例,进一步筛选的过程是通过分调整来完成,即可以增加符合辅助信号识别的候选词的分值,也可以减小不符合辅助信号识别的候选词的分值,当基本信号和辅助信号全部处理完毕后,选择分值最高的候选词作为识别结果。
作为本实施例的可选实施例,为了提高语音识别准确率加入的利用辅助信息对识别结果进行确认对用户是可选的,语音识别器根据输入语音确定识别结果。为上述识别结果计算出一个可能性度量值。如果该可能性度量值小于阈值,则向用户提示是否输入辅助数据或者自动开启辅助数据识别。如果该可能性度量值大于阈值,则向用户提示是否关闭辅助数据或者自动关闭辅助数据识别。阈值的具体数值不进行限定,由经验值得出或者根据用户体验得出。
基于本上述实施例提高的语音识别方法,对现有的各种形式的人机交互技术,包括手势识别、喉部振动识别、语音识别、人脸识别、唇形识别技术等进行融合,语音识别作为基本信号,配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为
辅助信号进行语音识别候选词的分调整。运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列,利很好地做到了提升终端语音识别稳定与操作的舒适。
综上所述,通过本发明提供的一种语音识别处理方法及装置,在语音识别的基础上,作为基本信号,配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为辅助信号。解决了相关技术中语音识别率低而导致的用户体验度差的问题。利用各技术在其应用领域的优势,取长补短,各技术模块相对独立又相互融合,大大提高语音处理识别率。
在另外一个实施例中,还提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。
在另外一个实施例中,还提供了一种存储介质,该存储介质中存储有上述软件,该存储介质包括但不限于:光盘、软盘、硬盘、可擦写存储器等。
此外,本发明实施例还记载一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令配置为执行图1所示的语音识别方法。
在本发明所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、随机存取存储器(RAM,Random Access Memory)、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、RAM、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
如上所述,本发明实施例提供的一种语音识别的方法、装置及计算机存储介质具有以下有益效果:有效地改善了仅通过用户的声音获取用户的讲话内容的语音识别的准确度,进而提高了语音识别的准确性。
Claims (18)
- 一种语音识别的方法,包括:获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息;根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
- 根据权利要求1所述的方法,其中,根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果包括:根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇;根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;根据所述一个或者多个第一候选词汇和所述词汇类别确定所述用户当前语音的最终识别结果;或者,根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。
- 根据权利要求2所述的方法,其中,根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果包括:从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇,将所述第一特定词汇作为所述用户当前语音的最终识别结果。
- 根据权利要求2所述的方法,其中,根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果包括:从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇,将所述第二特定词汇作为所述用户当前语音的最终识别结果。
- 根据权利要求1所述的方法,其中,基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息包括:获取用于指示所述用户当前状态的图像;根据所述图像获取图像特征信息;根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。
- 根据权利要求5所述的方法,其中,根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇包括:在预定的图像库中查找与所述图像特征信息相似度最高的特定图像;根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。
- 根据权利要求1至6中任一项所述的方法,其中,所述用户当前状态包括以下至少之一:所述用户的唇形运动状态、所述用户的喉部振动状态、所述用户的脸部运动状态、所述用户的手势运动状态。
- 根据权利要求1至7中任一项所述的方法,其中,获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息之前包括:判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。
- 一种语音识别的装置,所述装置包括:获取模块,设置为获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息;确定模块,设置为根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
- 根据权利要求9所述的装置,其中,所述确定模块包括:第一获取单元,设置为根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇;第二获取单元,设置为根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇;确定单元,设置为根据所述一个或者多个第一候选词汇和所述词汇类别确定所述用户当前语音的最终识别结果;或者,根据所述一个或者多个第一候选 词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。
- 根据权利要求10所述的装置,其中,所述确定单元还设置为从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇,将所述第一特定词汇作为所述用户当前语音的最终识别结果。
- 根据权利要求10所述的装置,其中,所述确定单元还设置为从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇,将所述第二特定词汇作为所述用户当前语音的最终识别结果。
- 根据权利要求9所述的装置,其中,所述获取模块还包括:第三获取单元,设置为获取用于指示所述用户当前状态的图像;第四获取单元,设置为根据所述图像获取图像特征信息;第五获取单元,设置为根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇,将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。
- 根据权利要求13所述的装置,其中,所述第五获取单元还包括:查找子单元,设置为在预定的图像库中查找与所述图像特征信息相似度最高的特定图像;获取子单元,设置为根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系,获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。
- 根据权利要求9至14中任一项所述的装置,其中,所述用户当前状态包括以下至少之一:所述用户的唇形运动状态、所述用户的喉部振动状态、所述用户的脸部运动状态、所述用户的手势运动状态。
- 根据权利要求9至15中任一项所述的装置,其中,所述装置还包括:判定模块,设置为判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。
- 一种终端,包括处理器,所述处理器设置为获取用户当前语音的语音识别信息,以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅 助识别信息;根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。
- 一种计算机存储介质,所述计算机存储介质中存储有计算机可执行指令,所述计算机可执行指令用于权利要求1至8任一项所述的语音识别的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510130636.2 | 2015-03-24 | ||
CN201510130636.2A CN106157956A (zh) | 2015-03-24 | 2015-03-24 | 语音识别的方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016150001A1 true WO2016150001A1 (zh) | 2016-09-29 |
Family
ID=56976870
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/079317 WO2016150001A1 (zh) | 2015-03-24 | 2015-05-19 | 语音识别的方法、装置及计算机存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106157956A (zh) |
WO (1) | WO2016150001A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018045703A1 (zh) * | 2016-09-07 | 2018-03-15 | 中兴通讯股份有限公司 | 语音处理方法、装置及终端设备 |
EP3618457A1 (en) * | 2018-09-02 | 2020-03-04 | Oticon A/s | A hearing device configured to utilize non-audio information to process audio signals |
CN110865705A (zh) * | 2019-10-24 | 2020-03-06 | 中国人民解放军军事科学院国防科技创新研究院 | 多模态融合的通讯方法、装置、头戴设备及存储介质 |
CN112672021A (zh) * | 2020-12-25 | 2021-04-16 | 维沃移动通信有限公司 | 语言识别方法、装置及电子设备 |
CN116434027A (zh) * | 2023-06-12 | 2023-07-14 | 深圳星寻科技有限公司 | 一种基于图像识别人工智能交互系统 |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106875941B (zh) * | 2017-04-01 | 2020-02-18 | 彭楚奥 | 一种服务机器人的语音语义识别方法 |
CN109213970B (zh) * | 2017-06-30 | 2022-07-29 | 北京国双科技有限公司 | 笔录生成方法及装置 |
CN108074561A (zh) * | 2017-12-08 | 2018-05-25 | 北京奇虎科技有限公司 | 语音处理方法及装置 |
CN108010526B (zh) * | 2017-12-08 | 2021-11-23 | 北京奇虎科技有限公司 | 语音处理方法及装置 |
CN107945789A (zh) * | 2017-12-28 | 2018-04-20 | 努比亚技术有限公司 | 语音识别方法、装置及计算机可读存储介质 |
CN108346427A (zh) * | 2018-02-05 | 2018-07-31 | 广东小天才科技有限公司 | 一种语音识别方法、装置、设备及存储介质 |
CN108449323B (zh) * | 2018-02-14 | 2021-05-25 | 深圳市声扬科技有限公司 | 登录认证方法、装置、计算机设备和存储介质 |
CN108446641A (zh) * | 2018-03-22 | 2018-08-24 | 深圳市迪比科电子科技有限公司 | 一种基于机器学习的口形图像识别系统及通过面纹识别发声的方法 |
CN108510988A (zh) * | 2018-03-22 | 2018-09-07 | 深圳市迪比科电子科技有限公司 | 一种用于聋哑人的语言识别系统及方法 |
CN110415689B (zh) * | 2018-04-26 | 2022-02-15 | 富泰华工业(深圳)有限公司 | 语音识别装置及方法 |
TWI682386B (zh) * | 2018-05-09 | 2020-01-11 | 廣達電腦股份有限公司 | 整合式語音辨識系統及方法 |
CN108986818A (zh) * | 2018-07-04 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | 视频通话挂断方法、装置、设备、服务端及存储介质 |
CN110830708A (zh) * | 2018-08-13 | 2020-02-21 | 深圳市冠旭电子股份有限公司 | 一种追踪摄像方法、装置及终端设备 |
CN108965621B (zh) * | 2018-10-09 | 2021-02-12 | 北京智合大方科技有限公司 | 自学习智能电话销售坐席助手的操作方法 |
CN109448711A (zh) * | 2018-10-23 | 2019-03-08 | 珠海格力电器股份有限公司 | 一种语音识别的方法、装置及计算机存储介质 |
CN109462694A (zh) * | 2018-11-19 | 2019-03-12 | 维沃移动通信有限公司 | 一种语音助手的控制方法及移动终端 |
CN109583359B (zh) * | 2018-11-26 | 2023-10-24 | 北京小米移动软件有限公司 | 表述内容识别方法、装置、电子设备、机器可读存储介质 |
CN109697976B (zh) * | 2018-12-14 | 2021-05-25 | 北京葡萄智学科技有限公司 | 一种发音识别方法及装置 |
CN109872714A (zh) * | 2019-01-25 | 2019-06-11 | 广州富港万嘉智能科技有限公司 | 一种提高语音识别准确性的方法、电子设备及存储介质 |
CN111951629A (zh) * | 2019-05-16 | 2020-11-17 | 上海流利说信息技术有限公司 | 一种发音纠正系统、方法、介质和计算设备 |
CN111447325A (zh) * | 2020-04-03 | 2020-07-24 | 上海闻泰电子科技有限公司 | 通话辅助方法、装置、终端及存储介质 |
CN111445912A (zh) * | 2020-04-03 | 2020-07-24 | 深圳市阿尔垎智能科技有限公司 | 语音处理方法和系统 |
CN113823278B (zh) * | 2021-09-13 | 2023-12-08 | 北京声智科技有限公司 | 语音识别方法、装置、电子设备及存储介质 |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002304194A (ja) * | 2001-02-05 | 2002-10-18 | Masanobu Kujirada | 音声及び/又は口形状入力のためのシステム、方法、プログラム |
CN101472066A (zh) * | 2007-12-27 | 2009-07-01 | 华晶科技股份有限公司 | 影像撷取装置的近端控制方法及应用该方法的影像撷取装置 |
US20100063820A1 (en) * | 2002-09-12 | 2010-03-11 | Broadcom Corporation | Correlating video images of lip movements with audio signals to improve speech recognition |
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
CN102298443A (zh) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | 结合视频通道的智能家居语音控制系统及其控制方法 |
CN102324035A (zh) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | 口型辅助语音识别术在车载导航中应用的方法及系统 |
CN104409075A (zh) * | 2014-11-28 | 2015-03-11 | 深圳创维-Rgb电子有限公司 | 语音识别方法和系统 |
CN104423543A (zh) * | 2013-08-26 | 2015-03-18 | 联想(北京)有限公司 | 一种信息处理方法及装置 |
CN105096935A (zh) * | 2014-05-06 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 一种语音输入方法、装置和系统 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS62239231A (ja) * | 1986-04-10 | 1987-10-20 | Kiyarii Rabo:Kk | 口唇画像入力による音声認識方法 |
JPS6419399A (en) * | 1987-07-15 | 1989-01-23 | Mitsubishi Electric Corp | Voice recognition equipment |
-
2015
- 2015-03-24 CN CN201510130636.2A patent/CN106157956A/zh active Pending
- 2015-05-19 WO PCT/CN2015/079317 patent/WO2016150001A1/zh active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002304194A (ja) * | 2001-02-05 | 2002-10-18 | Masanobu Kujirada | 音声及び/又は口形状入力のためのシステム、方法、プログラム |
US20100063820A1 (en) * | 2002-09-12 | 2010-03-11 | Broadcom Corporation | Correlating video images of lip movements with audio signals to improve speech recognition |
CN101472066A (zh) * | 2007-12-27 | 2009-07-01 | 华晶科技股份有限公司 | 影像撷取装置的近端控制方法及应用该方法的影像撷取装置 |
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
CN102298443A (zh) * | 2011-06-24 | 2011-12-28 | 华南理工大学 | 结合视频通道的智能家居语音控制系统及其控制方法 |
CN102324035A (zh) * | 2011-08-19 | 2012-01-18 | 广东好帮手电子科技股份有限公司 | 口型辅助语音识别术在车载导航中应用的方法及系统 |
CN104423543A (zh) * | 2013-08-26 | 2015-03-18 | 联想(北京)有限公司 | 一种信息处理方法及装置 |
CN105096935A (zh) * | 2014-05-06 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 一种语音输入方法、装置和系统 |
CN104409075A (zh) * | 2014-11-28 | 2015-03-11 | 深圳创维-Rgb电子有限公司 | 语音识别方法和系统 |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018045703A1 (zh) * | 2016-09-07 | 2018-03-15 | 中兴通讯股份有限公司 | 语音处理方法、装置及终端设备 |
EP3618457A1 (en) * | 2018-09-02 | 2020-03-04 | Oticon A/s | A hearing device configured to utilize non-audio information to process audio signals |
US11122373B2 (en) | 2018-09-02 | 2021-09-14 | Oticon A/S | Hearing device configured to utilize non-audio information to process audio signals |
US11689869B2 (en) | 2018-09-02 | 2023-06-27 | Oticon A/S | Hearing device configured to utilize non-audio information to process audio signals |
CN110865705A (zh) * | 2019-10-24 | 2020-03-06 | 中国人民解放军军事科学院国防科技创新研究院 | 多模态融合的通讯方法、装置、头戴设备及存储介质 |
CN110865705B (zh) * | 2019-10-24 | 2023-09-19 | 中国人民解放军军事科学院国防科技创新研究院 | 多模态融合的通讯方法、装置、头戴设备及存储介质 |
CN112672021A (zh) * | 2020-12-25 | 2021-04-16 | 维沃移动通信有限公司 | 语言识别方法、装置及电子设备 |
CN112672021B (zh) * | 2020-12-25 | 2022-05-17 | 维沃移动通信有限公司 | 语言识别方法、装置及电子设备 |
CN116434027A (zh) * | 2023-06-12 | 2023-07-14 | 深圳星寻科技有限公司 | 一种基于图像识别人工智能交互系统 |
Also Published As
Publication number | Publication date |
---|---|
CN106157956A (zh) | 2016-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016150001A1 (zh) | 语音识别的方法、装置及计算机存储介质 | |
CN107799126B (zh) | 基于有监督机器学习的语音端点检测方法及装置 | |
KR102339594B1 (ko) | 객체 인식 방법, 컴퓨터 디바이스 및 컴퓨터 판독 가능 저장 매체 | |
CN107481718B (zh) | 语音识别方法、装置、存储介质及电子设备 | |
CN110265040B (zh) | 声纹模型的训练方法、装置、存储介质及电子设备 | |
US11854550B2 (en) | Determining input for speech processing engine | |
US20150325240A1 (en) | Method and system for speech input | |
CN108346427A (zh) | 一种语音识别方法、装置、设备及存储介质 | |
CN108874895B (zh) | 交互信息推送方法、装置、计算机设备及存储介质 | |
AU2016277548A1 (en) | A smart home control method based on emotion recognition and the system thereof | |
US20230352000A1 (en) | Method and system for acoustic model conditioning on non-phoneme information features | |
WO2016173132A1 (zh) | 语音识别方法、装置及用户设备 | |
CN110972112B (zh) | 地铁运行方向的确定方法、装置、终端及存储介质 | |
WO2020140840A1 (zh) | 用于唤醒可穿戴设备的方法及装置 | |
CN112102850A (zh) | 情绪识别的处理方法、装置、介质及电子设备 | |
CN110706707B (zh) | 用于语音交互的方法、装置、设备和计算机可读存储介质 | |
CN111341350A (zh) | 人机交互控制方法、系统、智能机器人及存储介质 | |
WO2014173325A1 (zh) | 喉音识别方法及装置 | |
CN111326152A (zh) | 语音控制方法及装置 | |
CN110931018A (zh) | 智能语音交互的方法、装置及计算机可读存储介质 | |
CN113129867A (zh) | 语音识别模型的训练方法、语音识别方法、装置和设备 | |
US10847154B2 (en) | Information processing device, information processing method, and program | |
CN109065026B (zh) | 一种录音控制方法及装置 | |
CN114120979A (zh) | 语音识别模型的优化方法、训练方法、设备及介质 | |
CN114239610A (zh) | 多国语言语音辨识及翻译方法与相关的系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15885939 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15885939 Country of ref document: EP Kind code of ref document: A1 |