Disclosure of Invention
In view of this, the present disclosure provides a speech recognition method to improve accuracy of a speech recognition result.
The application provides a voice recognition method, which comprises the following steps:
in a preset time period, when the voice function of the releasing equipment is enabled, a voice acquisition module and a video data acquisition module are started simultaneously to acquire audio data and video data with the same time attribute;
marking two mouth corners of the lips and the lips of the upper lip and the lower lip collected in the video data by adopting a four-point marking method;
measuring a mouth angle opening angle, a distance between an upper lip and a lower lip and a distance between the two mouth angles when the mouth shape changes every time in the video data to obtain multiple groups of measurement data, wherein the mouth angle opening angle comprises a first included angle and a second included angle, the first included angle is an included angle between a connecting line between the upper lip and the mouth angle and a connecting line between the two mouth angles, and the second included angle is an included angle between the connecting line between the lower lip and the mouth angle and the connecting line between the two mouth angles;
inputting the mouth angle opening angle, the distance and the distance in each group of measurement data as input parameters into an evaluation function to obtain lip change data, wherein the voice evaluation function is as follows: pre ═ k ═ Angle (a, b) + q × Line (LAF, LDE)), k, p, q are weight coefficients of each cost function, Angle (a, b) are cost functions of the first included Angle and the second included Angle, Line (LAF, LDE) are cost functions of the distance and the distance;
comparing the lip change data with the lip database to obtain a lip language recognition result, and performing audio recognition on the audio data to obtain a voice recognition result;
determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result;
performing weighted summation on the accuracy, the percentage and the quantity to obtain a voice recognition characteristic value for representing the voice recognition characteristic of the delivering device;
after the voice recognition characteristic values of the releasing devices are obtained, calculating a first difference value between the voice recognition characteristic values of other releasing devices and the voice recognition characteristic value of the releasing device for each releasing device;
sequencing the first difference values according to a specified sequence to obtain a similarity sequencing for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise;
after the similarity ranking corresponding to the releasing equipment is obtained, when the releasing equipment enables the voice function again, selecting target releasing equipment from other releasing equipment according to the similarity ranking corresponding to the releasing equipment, and starting a voice acquisition module of the target releasing equipment to acquire a target audio signal, wherein the target releasing equipment is the releasing equipment with the highest similarity ranking in the releasing equipment which is not used by a user when the releasing equipment enables the voice function again;
according to the audio characteristics of the target audio signal, audio separation is carried out on the audio signal collected by the throwing equipment so as to obtain user audio collected when the throwing equipment enables the voice function again;
and carrying out voice recognition according to the user audio to obtain a voice instruction of the user.
Optionally, after obtaining the lip language recognition result and the voice recognition result, the similarity ranking may also be obtained by:
determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result; for each throwing device, calculating a first similarity of the accuracy rate of the throwing device and other throwing devices, a second similarity of the occupation ratio and a third similarity of the number of continuous wrong characters; weighting and summing the first similarity, the second similarity and the third similarity corresponding to the throwing device to obtain similarities which are used for representing that the throwing device and other throwing devices are influenced by environment and noise; sequencing the similarity according to a specified sequence to obtain a similarity sequence for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise; or,
performing text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the voice recognition result to obtain text similarity corresponding to the releasing equipment; for each delivery device, calculating a second difference value between the text similarity of the delivery device and the text similarity of other delivery devices; and sorting the second difference values according to a specified sequence to obtain a similarity sorting which is used for representing the similar situation of the projection device and other projection devices influenced by the surrounding environment and noise.
Optionally, according to the audio feature of the target audio signal, performing audio separation on the audio signal collected by the delivery device to obtain a user audio collected when the delivery device enables the voice function again, including:
aligning the target audio signal with audio frames with the same acquisition time in the audio signals acquired by the launching equipment to obtain a plurality of audio frame groups with the same audio acquisition time;
for each group of audio frames, using the audio characteristics of the audio frame corresponding to the target audio signal in the group of audio frames to perform audio separation on the audio frame corresponding to the audio signal acquired by the launching device, so as to obtain the user audio frame acquired when the launching device corresponding to the group of audio frames enables the voice function again;
and splicing the user audio frames corresponding to the plurality of audio frame groups to obtain the user audio collected when the voice function of the releasing device is enabled again.
As the audio and noise in the surrounding environment may change over time, for example: the audio frequency and noise in the surrounding environment are higher than other time periods during early peak and late peak, therefore, in the application, the audio data needs to be periodically collected so as to determine the delivering devices with similar audio frequency and noise in the surrounding environment in the corresponding time period, when the delivering devices with similar audio frequency and noise in the surrounding environment are determined, the lip language recognition result and the voice recognition result in the current time period are determined for each device, because the lip language recognition result is not influenced by the audio frequency and noise in the surrounding environment, the accuracy rate of the voice recognition result can be determined by using the lip language recognition result, and the occupation ratio of the wrongly recognized words in the voice recognition result in each division interval of the whole voice and the number of continuous wrong words in the voice recognition result, wherein the accuracy rate of the voice recognition result is used for representing the influence degree of the audio frequency and noise in the surrounding environment on the voice recognition result, the proportion of the wrongly recognized characters in the voice recognition result in each partitioned area of the whole voice is used for representing the distribution condition of the wrongly recognized characters in the whole voice, the number of continuous wrong characters in the voice recognition result is used for representing the occurrence condition of sudden audio and noise in the surrounding environment, the three parameters are used for determining a first difference value between the voice recognition characteristic value of other delivery devices and the voice recognition characteristic value of the delivery device, the first difference value is used for representing the similarity condition of the delivery device and the environment where the other delivery devices are located, then the first difference value is sorted, the obtained sorting result is used for representing the similarity sorting of the delivery device and the other delivery devices affected by the surrounding environment and the noise, when the voice function is enabled again by using one delivery device, in order to reduce the influence of the audio and the noise in the environment on the delivery device, when the voice function of the releasing device is enabled, the voice acquisition module of the target releasing device with the highest similarity ranking in the releasing device which is not used by the user is started at the same time, at the moment, the releasing device can acquire mixed audio, the target releasing device can acquire mixed audio formed by audio and noise in the surrounding environment, then audio separation is carried out on audio signals acquired by the releasing device through the audio characteristic features of the mixed audio in the target releasing device, so that the audio of the user acquired by the releasing device is obtained, and the influence of the audio and the noise in the surrounding environment on the audio is reduced in the obtained audio of the user, so that the accuracy of a voice recognition result is improved, and the accuracy of obtaining a voice instruction is improved.
In order to make the aforementioned objects, features and advantages of the present application comprehensible, preferred embodiments accompanied with figures are described in detail below.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The following embodiments can be applied to the aspect of health management of people to assist people in a healthier life. The following is a detailed description of the present application.
Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 1, the speech recognition method includes the following steps:
step 101, in a preset time period, when the voice function of the releasing device is enabled, the voice acquisition module and the video data acquisition module are started simultaneously to acquire audio data and video data with the same time attribute.
Specifically, the types of delivered devices include a blood pressure detection device, a heart rate detection device, or a multifunctional integrated detection device, and the delivered devices include a plurality of devices and are delivered to different locations so as to perform detection coverage on a certain area, since audio and noise in the surrounding environment may change with the passage of time, for example: the audio and noise in the surrounding environment are higher than those in other time periods during early and late peak hours, so that the audio data needs to be periodically collected so as to determine the influence of the audio and noise in the surrounding environment on the voice recognition result in the corresponding time period, at the moment, the voice collection module and the video data collection module need to be simultaneously started so as to collect the audio data and the video data with the same time attribute, and because the lip language in the video data is not influenced by the audio and noise in the surrounding environment, the influence degree of the audio and noise in the surrounding environment on the voice recognition result can be determined according to the video data.
And 102, marking two mouth corners of the lips and the lips of the upper lip and the lower lip collected in the video data by adopting a four-point marking method.
Specifically, since the mouth shapes corresponding to different pronunciations are different, the mouth shapes here are different and include: in order to obtain the data, the two mouth corners of the lips and the lips of the upper and lower lips collected in the video data need to be labeled, so that the lip language recognition result is obtained according to the data obtained by the connecting line between the labeling points.
It should be noted that other numbers of label switches may be used, such as: in the six-point labeling method, the six point positions are A, F points at the two corners of the mouth, B, G points are the middle points of the left and right lips in the upper lip, and C, H points are the middle points of the left and right lips in the lower lip, and the specific labeling method can be set according to the actual required precision, and is not described in detail herein.
Step 103, measuring a mouth angle opening angle, a distance between the upper lip and the lower lip and a distance between the two mouth angles when the mouth shape changes every time in the video data to obtain multiple groups of measurement data, wherein the mouth angle opening angle comprises a first included angle and a second included angle, the first included angle is an included angle between a connecting line between the upper lip and the mouth angle and a connecting line between the two mouth angles, and the second included angle is an included angle between a connecting line between the lower lip and the mouth angle and a connecting line between the two mouth angles.
Step 104, inputting the mouth angle opening angle, the distance and the distance in each group of measurement data as input parameters into an evaluation function to obtain lip change data, wherein the voice evaluation function is as follows: pre ═ k ═ Angle (a, b) + q · Line (LAF, LDE)), k, p, q are weight coefficients of respective cost functions, Angle (a, b) is a cost function of the first Angle and the second Angle, and Line (LAF, LDE) is a cost function of the pitch and the distance.
It should be noted that LAF is the distance between two corners of the mouth, LDE is the distance between the upper lip and the lower lip, and since the pronunciation method is different in different regions, p, q, and k can be set according to the placement area of the placement device.
And 105, comparing the lip change data with the lip database to obtain a lip language recognition result, and performing audio recognition on the audio data to obtain a voice recognition result.
And 106, determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of the continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result.
Specifically, the accuracy of the speech recognition result is used to represent the influence degree of the audio and noise in the surrounding environment on the whole speech recognition result, the proportion of the characters with recognition errors in the speech recognition result in each division area of the whole speech is used to represent the distribution condition of the characters with recognition errors in the whole speech, the number of continuous wrong characters in the speech recognition result is used to represent the occurrence condition of sudden audio and noise in the surrounding environment, and the general condition of the audio and noise in the surrounding environment can be determined through the three parameters.
And 107, carrying out weighted summation on the accuracy, the ratio and the quantity to obtain a voice recognition characteristic value for representing the voice recognition characteristic of the delivery device.
Specifically, since different data have different influences on the accuracy of the speech recognition result, different weights need to be assigned to the three parameters, and then the weighted sum is performed to use the obtained result as the speech recognition characteristic value of the speech recognition characteristic of the delivery device.
It should be noted that, the specific weight distribution condition of the three parameters may be set according to actual needs, and is not specifically limited herein.
And 108, after the voice recognition characteristic values of the releasing devices are obtained, calculating a first difference value between the voice recognition characteristic values of other releasing devices and the voice recognition characteristic value of the releasing device for each releasing device.
Specifically, the first difference is used to characterize a similarity between environments of the two delivery devices, wherein a smaller first difference indicates a higher similarity between the environments of the two delivery devices.
And step 109, sequencing the first difference values according to a specified sequence to obtain a similarity sequence for representing the similar situation of the projection equipment and other projection equipment influenced by the surrounding environment and noise.
Specifically, after the first difference is sorted, the similarity between a certain delivery device and the environment where other delivery devices are located can be known, so that the ranking of the similar situations of the delivery device and other delivery devices affected by the surrounding environment and noise can be determined according to the sorting, for example: the delivery device comprises: the method comprises the steps of putting equipment 1, putting equipment 2, putting equipment 3 and putting equipment 4, taking the putting equipment 1 as an example, sorting the three difference values after obtaining the difference value between the putting equipment 1 and the putting equipment 2, the difference value between the putting equipment 1 and the putting equipment 3 and the difference value between the putting equipment 1 and the putting equipment 4, so that the similarity ranking of the putting equipment 2, the putting equipment 3 and the putting equipment 4 and the environment where the putting equipment 1 is located can be obtained.
It should be noted that, for different delivering devices, the obtained similarity ranks may be different, for example, after ranking the environment where delivering device 1 is located, the obtained delivering device with the highest similarity may be delivering device 2, but after ranking the environment where delivering device 2 is located, the obtained delivering device with the highest similarity may be delivering device 3.
And step 110, after the similarity ranking corresponding to the releasing device is obtained, when the releasing device enables the voice function again, selecting a target releasing device from other releasing devices according to the similarity ranking corresponding to the releasing device, and starting a voice acquisition module of the target releasing device to acquire a target audio signal, wherein the target releasing device is the releasing device with the highest similarity ranking in the releasing devices which are not used by the user when the releasing device enables the voice function again.
Specifically, after obtaining the similarity ranking corresponding to each delivery device, when a certain delivery device is used by the user again, when the voice function of the delivery device is enabled, it is necessary to start the delivery devices that are not used in the other delivery devices and have the highest similarity with the environment where the delivery device is located, so that the delivery device 1 is used again with the voice function, and the similarities with the environment where the delivery device 1 is located are ranked from high to low as the delivery device 2, the delivery device 3, and the delivery device 4, when the delivery device 1 is used again with the voice function, it is necessary to detect whether the delivery device 2 is used, if not used, when the voice function of the delivery device 1 is started, the voice collecting module of the delivery device 2 is started at the same time, so that the delivery device 2 can collect the audio and the noise of the surrounding environment, if the delivery device 2 is currently used, it is necessary to detect whether the delivery device 3 is used, and if not, when the voice function of the delivery device 1 is started, the voice capture module of the delivery device 3 is started at the same time, so that the delivery device 3 can capture the audio and noise of the surrounding environment, and so on until the target delivery device is obtained.
And step 111, performing audio separation on the audio signal collected by the throwing device according to the audio characteristic of the target audio signal to obtain the user audio collected when the throwing device enables the voice function again.
Specifically, after a voice acquisition module of the target delivery device acquires a target audio signal formed by audio and noise in an environment, audio separation can be performed on the audio signal acquired by the delivery device by using audio characteristics of the target audio signal, so that user audio of a user acquired by the delivery device is obtained.
The separation method to be used specifically may be set according to actual needs, and is not particularly limited herein.
And step 112, performing voice recognition according to the user audio to obtain a voice instruction of the user.
Specifically, after performing speech recognition, text information corresponding to the user audio may be obtained, and then a user instruction of the user is determined through semantic analysis, for example: when the text information recognized by the voice is 'i want to detect blood pressure', after semantic analysis, it can be determined that the user wants to start the blood pressure detection function at this time, and the blood pressure detection module can be started at this time.
Certainly after determining the function that the user wants to start, can also show on the display interface whether start the warning information of corresponding function, directly start corresponding module after the user confirms, come the supplementary patient through the mode of pronunciation and carry out health detection and can reduce the use degree of difficulty that the patient used input device.
In the application, the influence of the audio frequency and the noise in the surrounding environment on the audio frequency of the user is reduced before the voice recognition is carried out, so that the accuracy of the voice recognition result is favorably improved, and the accuracy of obtaining the voice instruction is favorably improved.
Meanwhile, compared with a recognition mode of recognizing the lip language of the user through video data, the voice recognition has the advantages of high recognition speed and low recognition cost.
In a possible embodiment, after obtaining the lip language recognition result and the voice recognition result, the similarity ranking may also be obtained by:
the first method is as follows: determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result; for each throwing device, calculating a first similarity of the accuracy rate of the throwing device and other throwing devices, a second similarity of the occupation ratio and a third similarity of the number of continuous wrong characters; weighting and summing the first similarity, the second similarity and the third similarity corresponding to the throwing device to obtain similarities which are used for representing that the throwing device and other throwing devices are influenced by environment and noise; and sequencing the similarity according to a specified sequence to obtain a similarity sequence for representing the similar situation of the delivering device and other delivering devices influenced by the surrounding environment and noise.
The second method comprises the following steps: performing text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the voice recognition result to obtain text similarity corresponding to the releasing equipment; for each delivery device, calculating a second difference value between the text similarity of the delivery device and the text similarity of other delivery devices; and sorting the second difference values according to a specified sequence to obtain a similarity sorting which is used for representing the similar situation of the projection device and other projection devices influenced by the surrounding environment and noise.
In a possible implementation, fig. 2 is a schematic flowchart of another speech recognition method provided in an embodiment of the present application, and as shown in fig. 2, when step 111 is executed, the method may be implemented by:
step 201, aligning the target audio signal and audio frames with the same acquisition time in the audio signal acquired by the launching device to obtain a plurality of audio frame groups with the same audio acquisition time.
Step 202, for each group of audio frames, using the audio features of the audio frames corresponding to the target audio signals in the group of audio frames to perform audio separation on the audio frames corresponding to the audio signals collected by the launching device, so as to obtain the user audio frames collected when the launching device corresponding to the group of audio frames enables the voice function again.
And 203, splicing the user audio frames corresponding to the plurality of audio frame groups to obtain the user audio collected when the voice function of the launching device is enabled again.
Specifically, the target audio signal and the audio signal collected by the delivering device have the same recording start time point, so that the audio collected by the two delivering devices can be aligned in time, the audio frames with the same time stamp are divided into the same group, for the audio frames in the same group, the current environments of the two corresponding delivering devices are relatively similar, after the video frames are aligned, the audio frames in the audio signal collected by the delivering device can be subjected to audio separation by using the audio frames corresponding to the target audio signal, and when the audio frames are subjected to audio separation, the audio frame groups can be processed in parallel, so that the data processing speed is improved.
It should be noted that, the splicing method specifically used in audio splicing may be set according to actual needs, and is not specifically limited herein.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used to illustrate the technical solutions of the present application, but not to limit the technical solutions, and the scope of the present application is not limited to the above-mentioned embodiments, although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.