Nothing Special   »   [go: up one dir, main page]

CN114420124B - Speech recognition method - Google Patents

Speech recognition method Download PDF

Info

Publication number
CN114420124B
CN114420124B CN202210328042.2A CN202210328042A CN114420124B CN 114420124 B CN114420124 B CN 114420124B CN 202210328042 A CN202210328042 A CN 202210328042A CN 114420124 B CN114420124 B CN 114420124B
Authority
CN
China
Prior art keywords
audio
voice
similarity
recognition result
voice recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210328042.2A
Other languages
Chinese (zh)
Other versions
CN114420124A (en
Inventor
赵进
刘邦长
赵红文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Miaoyijia Health Technology Group Co ltd
Original Assignee
Beijing Miaoyijia Health Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Miaoyijia Health Technology Group Co ltd filed Critical Beijing Miaoyijia Health Technology Group Co ltd
Priority to CN202210328042.2A priority Critical patent/CN114420124B/en
Publication of CN114420124A publication Critical patent/CN114420124A/en
Application granted granted Critical
Publication of CN114420124B publication Critical patent/CN114420124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application provides a voice recognition method, in the application, after recognizing the similar throwing devices of the environment in which the current moment is located, when carrying out voice recognition, two throwing devices are started simultaneously, one throwing device is used for collecting audio signals containing user audio, the other throwing device is used for collecting mixed audio of audio and noise in the environment, then audio separation is carried out on the audio signals containing the user audio through the mixed audio, so that the user audio which is less influenced by the environment is obtained, then voice recognition is carried out on the user audio, and therefore a voice instruction of a user is obtained.

Description

Speech recognition method
Technical Field
The application relates to the technical field of voice recognition, in particular to a voice recognition method.
Background
With the development of technology, speech recognition technology is applied to more and more fields in order to assist users in using devices.
For medical detection equipment put into the market, in order to facilitate self-service use of patients, a voice recognition function is installed on the medical detection equipment, and the patients can enable corresponding functions through voice instructions (i.e. starting the corresponding functions on the medical detection equipment), for example: the voice of the patient may be "i want to detect blood pressure", and the medical detection device recognizes the text information through the voice recognition function, and then activates the blood pressure detection function after determining the function that the patient wants to enable through the semantic meaning of the text information.
For medical detection devices that are put on the market, when performing speech recognition, the medical detection devices are affected by surrounding audio and noise, for example: the voice recognition method comprises the following steps of generating voice by vehicles running on the street, generating voice by surrounding people, generating electromagnetic noise in the surrounding environment and the like, wherein the environmental audio and the noise can influence the voice recognition, so that the voice recognition result is inaccurate.
Disclosure of Invention
In view of this, the present disclosure provides a speech recognition method to improve accuracy of a speech recognition result.
The application provides a voice recognition method, which comprises the following steps:
in a preset time period, when the voice function of the releasing equipment is enabled, a voice acquisition module and a video data acquisition module are started simultaneously to acquire audio data and video data with the same time attribute;
marking two mouth corners of the lips and the lips of the upper lip and the lower lip collected in the video data by adopting a four-point marking method;
measuring a mouth angle opening angle, a distance between an upper lip and a lower lip and a distance between the two mouth angles when the mouth shape changes every time in the video data to obtain multiple groups of measurement data, wherein the mouth angle opening angle comprises a first included angle and a second included angle, the first included angle is an included angle between a connecting line between the upper lip and the mouth angle and a connecting line between the two mouth angles, and the second included angle is an included angle between the connecting line between the lower lip and the mouth angle and the connecting line between the two mouth angles;
inputting the mouth angle opening angle, the distance and the distance in each group of measurement data as input parameters into an evaluation function to obtain lip change data, wherein the voice evaluation function is as follows: pre ═ k ═ Angle (a, b) + q × Line (LAF, LDE)), k, p, q are weight coefficients of each cost function, Angle (a, b) are cost functions of the first included Angle and the second included Angle, Line (LAF, LDE) are cost functions of the distance and the distance;
comparing the lip change data with the lip database to obtain a lip language recognition result, and performing audio recognition on the audio data to obtain a voice recognition result;
determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result;
performing weighted summation on the accuracy, the percentage and the quantity to obtain a voice recognition characteristic value for representing the voice recognition characteristic of the delivering device;
after the voice recognition characteristic values of the releasing devices are obtained, calculating a first difference value between the voice recognition characteristic values of other releasing devices and the voice recognition characteristic value of the releasing device for each releasing device;
sequencing the first difference values according to a specified sequence to obtain a similarity sequencing for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise;
after the similarity ranking corresponding to the releasing equipment is obtained, when the releasing equipment enables the voice function again, selecting target releasing equipment from other releasing equipment according to the similarity ranking corresponding to the releasing equipment, and starting a voice acquisition module of the target releasing equipment to acquire a target audio signal, wherein the target releasing equipment is the releasing equipment with the highest similarity ranking in the releasing equipment which is not used by a user when the releasing equipment enables the voice function again;
according to the audio characteristics of the target audio signal, audio separation is carried out on the audio signal collected by the throwing equipment so as to obtain user audio collected when the throwing equipment enables the voice function again;
and carrying out voice recognition according to the user audio to obtain a voice instruction of the user.
Optionally, after obtaining the lip language recognition result and the voice recognition result, the similarity ranking may also be obtained by:
determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result; for each throwing device, calculating a first similarity of the accuracy rate of the throwing device and other throwing devices, a second similarity of the occupation ratio and a third similarity of the number of continuous wrong characters; weighting and summing the first similarity, the second similarity and the third similarity corresponding to the throwing device to obtain similarities which are used for representing that the throwing device and other throwing devices are influenced by environment and noise; sequencing the similarity according to a specified sequence to obtain a similarity sequence for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise; or,
performing text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the voice recognition result to obtain text similarity corresponding to the releasing equipment; for each delivery device, calculating a second difference value between the text similarity of the delivery device and the text similarity of other delivery devices; and sorting the second difference values according to a specified sequence to obtain a similarity sorting which is used for representing the similar situation of the projection device and other projection devices influenced by the surrounding environment and noise.
Optionally, according to the audio feature of the target audio signal, performing audio separation on the audio signal collected by the delivery device to obtain a user audio collected when the delivery device enables the voice function again, including:
aligning the target audio signal with audio frames with the same acquisition time in the audio signals acquired by the launching equipment to obtain a plurality of audio frame groups with the same audio acquisition time;
for each group of audio frames, using the audio characteristics of the audio frame corresponding to the target audio signal in the group of audio frames to perform audio separation on the audio frame corresponding to the audio signal acquired by the launching device, so as to obtain the user audio frame acquired when the launching device corresponding to the group of audio frames enables the voice function again;
and splicing the user audio frames corresponding to the plurality of audio frame groups to obtain the user audio collected when the voice function of the releasing device is enabled again.
As the audio and noise in the surrounding environment may change over time, for example: the audio frequency and noise in the surrounding environment are higher than other time periods during early peak and late peak, therefore, in the application, the audio data needs to be periodically collected so as to determine the delivering devices with similar audio frequency and noise in the surrounding environment in the corresponding time period, when the delivering devices with similar audio frequency and noise in the surrounding environment are determined, the lip language recognition result and the voice recognition result in the current time period are determined for each device, because the lip language recognition result is not influenced by the audio frequency and noise in the surrounding environment, the accuracy rate of the voice recognition result can be determined by using the lip language recognition result, and the occupation ratio of the wrongly recognized words in the voice recognition result in each division interval of the whole voice and the number of continuous wrong words in the voice recognition result, wherein the accuracy rate of the voice recognition result is used for representing the influence degree of the audio frequency and noise in the surrounding environment on the voice recognition result, the proportion of the wrongly recognized characters in the voice recognition result in each partitioned area of the whole voice is used for representing the distribution condition of the wrongly recognized characters in the whole voice, the number of continuous wrong characters in the voice recognition result is used for representing the occurrence condition of sudden audio and noise in the surrounding environment, the three parameters are used for determining a first difference value between the voice recognition characteristic value of other delivery devices and the voice recognition characteristic value of the delivery device, the first difference value is used for representing the similarity condition of the delivery device and the environment where the other delivery devices are located, then the first difference value is sorted, the obtained sorting result is used for representing the similarity sorting of the delivery device and the other delivery devices affected by the surrounding environment and the noise, when the voice function is enabled again by using one delivery device, in order to reduce the influence of the audio and the noise in the environment on the delivery device, when the voice function of the releasing device is enabled, the voice acquisition module of the target releasing device with the highest similarity ranking in the releasing device which is not used by the user is started at the same time, at the moment, the releasing device can acquire mixed audio, the target releasing device can acquire mixed audio formed by audio and noise in the surrounding environment, then audio separation is carried out on audio signals acquired by the releasing device through the audio characteristic features of the mixed audio in the target releasing device, so that the audio of the user acquired by the releasing device is obtained, and the influence of the audio and the noise in the surrounding environment on the audio is reduced in the obtained audio of the user, so that the accuracy of a voice recognition result is improved, and the accuracy of obtaining a voice instruction is improved.
In order to make the aforementioned objects, features and advantages of the present application comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of another speech recognition method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The following embodiments can be applied to the aspect of health management of people to assist people in a healthier life. The following is a detailed description of the present application.
Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 1, the speech recognition method includes the following steps:
step 101, in a preset time period, when the voice function of the releasing device is enabled, the voice acquisition module and the video data acquisition module are started simultaneously to acquire audio data and video data with the same time attribute.
Specifically, the types of delivered devices include a blood pressure detection device, a heart rate detection device, or a multifunctional integrated detection device, and the delivered devices include a plurality of devices and are delivered to different locations so as to perform detection coverage on a certain area, since audio and noise in the surrounding environment may change with the passage of time, for example: the audio and noise in the surrounding environment are higher than those in other time periods during early and late peak hours, so that the audio data needs to be periodically collected so as to determine the influence of the audio and noise in the surrounding environment on the voice recognition result in the corresponding time period, at the moment, the voice collection module and the video data collection module need to be simultaneously started so as to collect the audio data and the video data with the same time attribute, and because the lip language in the video data is not influenced by the audio and noise in the surrounding environment, the influence degree of the audio and noise in the surrounding environment on the voice recognition result can be determined according to the video data.
And 102, marking two mouth corners of the lips and the lips of the upper lip and the lower lip collected in the video data by adopting a four-point marking method.
Specifically, since the mouth shapes corresponding to different pronunciations are different, the mouth shapes here are different and include: in order to obtain the data, the two mouth corners of the lips and the lips of the upper and lower lips collected in the video data need to be labeled, so that the lip language recognition result is obtained according to the data obtained by the connecting line between the labeling points.
It should be noted that other numbers of label switches may be used, such as: in the six-point labeling method, the six point positions are A, F points at the two corners of the mouth, B, G points are the middle points of the left and right lips in the upper lip, and C, H points are the middle points of the left and right lips in the lower lip, and the specific labeling method can be set according to the actual required precision, and is not described in detail herein.
Step 103, measuring a mouth angle opening angle, a distance between the upper lip and the lower lip and a distance between the two mouth angles when the mouth shape changes every time in the video data to obtain multiple groups of measurement data, wherein the mouth angle opening angle comprises a first included angle and a second included angle, the first included angle is an included angle between a connecting line between the upper lip and the mouth angle and a connecting line between the two mouth angles, and the second included angle is an included angle between a connecting line between the lower lip and the mouth angle and a connecting line between the two mouth angles.
Step 104, inputting the mouth angle opening angle, the distance and the distance in each group of measurement data as input parameters into an evaluation function to obtain lip change data, wherein the voice evaluation function is as follows: pre ═ k ═ Angle (a, b) + q · Line (LAF, LDE)), k, p, q are weight coefficients of respective cost functions, Angle (a, b) is a cost function of the first Angle and the second Angle, and Line (LAF, LDE) is a cost function of the pitch and the distance.
It should be noted that LAF is the distance between two corners of the mouth, LDE is the distance between the upper lip and the lower lip, and since the pronunciation method is different in different regions, p, q, and k can be set according to the placement area of the placement device.
And 105, comparing the lip change data with the lip database to obtain a lip language recognition result, and performing audio recognition on the audio data to obtain a voice recognition result.
And 106, determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of the continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result.
Specifically, the accuracy of the speech recognition result is used to represent the influence degree of the audio and noise in the surrounding environment on the whole speech recognition result, the proportion of the characters with recognition errors in the speech recognition result in each division area of the whole speech is used to represent the distribution condition of the characters with recognition errors in the whole speech, the number of continuous wrong characters in the speech recognition result is used to represent the occurrence condition of sudden audio and noise in the surrounding environment, and the general condition of the audio and noise in the surrounding environment can be determined through the three parameters.
And 107, carrying out weighted summation on the accuracy, the ratio and the quantity to obtain a voice recognition characteristic value for representing the voice recognition characteristic of the delivery device.
Specifically, since different data have different influences on the accuracy of the speech recognition result, different weights need to be assigned to the three parameters, and then the weighted sum is performed to use the obtained result as the speech recognition characteristic value of the speech recognition characteristic of the delivery device.
It should be noted that, the specific weight distribution condition of the three parameters may be set according to actual needs, and is not specifically limited herein.
And 108, after the voice recognition characteristic values of the releasing devices are obtained, calculating a first difference value between the voice recognition characteristic values of other releasing devices and the voice recognition characteristic value of the releasing device for each releasing device.
Specifically, the first difference is used to characterize a similarity between environments of the two delivery devices, wherein a smaller first difference indicates a higher similarity between the environments of the two delivery devices.
And step 109, sequencing the first difference values according to a specified sequence to obtain a similarity sequence for representing the similar situation of the projection equipment and other projection equipment influenced by the surrounding environment and noise.
Specifically, after the first difference is sorted, the similarity between a certain delivery device and the environment where other delivery devices are located can be known, so that the ranking of the similar situations of the delivery device and other delivery devices affected by the surrounding environment and noise can be determined according to the sorting, for example: the delivery device comprises: the method comprises the steps of putting equipment 1, putting equipment 2, putting equipment 3 and putting equipment 4, taking the putting equipment 1 as an example, sorting the three difference values after obtaining the difference value between the putting equipment 1 and the putting equipment 2, the difference value between the putting equipment 1 and the putting equipment 3 and the difference value between the putting equipment 1 and the putting equipment 4, so that the similarity ranking of the putting equipment 2, the putting equipment 3 and the putting equipment 4 and the environment where the putting equipment 1 is located can be obtained.
It should be noted that, for different delivering devices, the obtained similarity ranks may be different, for example, after ranking the environment where delivering device 1 is located, the obtained delivering device with the highest similarity may be delivering device 2, but after ranking the environment where delivering device 2 is located, the obtained delivering device with the highest similarity may be delivering device 3.
And step 110, after the similarity ranking corresponding to the releasing device is obtained, when the releasing device enables the voice function again, selecting a target releasing device from other releasing devices according to the similarity ranking corresponding to the releasing device, and starting a voice acquisition module of the target releasing device to acquire a target audio signal, wherein the target releasing device is the releasing device with the highest similarity ranking in the releasing devices which are not used by the user when the releasing device enables the voice function again.
Specifically, after obtaining the similarity ranking corresponding to each delivery device, when a certain delivery device is used by the user again, when the voice function of the delivery device is enabled, it is necessary to start the delivery devices that are not used in the other delivery devices and have the highest similarity with the environment where the delivery device is located, so that the delivery device 1 is used again with the voice function, and the similarities with the environment where the delivery device 1 is located are ranked from high to low as the delivery device 2, the delivery device 3, and the delivery device 4, when the delivery device 1 is used again with the voice function, it is necessary to detect whether the delivery device 2 is used, if not used, when the voice function of the delivery device 1 is started, the voice collecting module of the delivery device 2 is started at the same time, so that the delivery device 2 can collect the audio and the noise of the surrounding environment, if the delivery device 2 is currently used, it is necessary to detect whether the delivery device 3 is used, and if not, when the voice function of the delivery device 1 is started, the voice capture module of the delivery device 3 is started at the same time, so that the delivery device 3 can capture the audio and noise of the surrounding environment, and so on until the target delivery device is obtained.
And step 111, performing audio separation on the audio signal collected by the throwing device according to the audio characteristic of the target audio signal to obtain the user audio collected when the throwing device enables the voice function again.
Specifically, after a voice acquisition module of the target delivery device acquires a target audio signal formed by audio and noise in an environment, audio separation can be performed on the audio signal acquired by the delivery device by using audio characteristics of the target audio signal, so that user audio of a user acquired by the delivery device is obtained.
The separation method to be used specifically may be set according to actual needs, and is not particularly limited herein.
And step 112, performing voice recognition according to the user audio to obtain a voice instruction of the user.
Specifically, after performing speech recognition, text information corresponding to the user audio may be obtained, and then a user instruction of the user is determined through semantic analysis, for example: when the text information recognized by the voice is 'i want to detect blood pressure', after semantic analysis, it can be determined that the user wants to start the blood pressure detection function at this time, and the blood pressure detection module can be started at this time.
Certainly after determining the function that the user wants to start, can also show on the display interface whether start the warning information of corresponding function, directly start corresponding module after the user confirms, come the supplementary patient through the mode of pronunciation and carry out health detection and can reduce the use degree of difficulty that the patient used input device.
In the application, the influence of the audio frequency and the noise in the surrounding environment on the audio frequency of the user is reduced before the voice recognition is carried out, so that the accuracy of the voice recognition result is favorably improved, and the accuracy of obtaining the voice instruction is favorably improved.
Meanwhile, compared with a recognition mode of recognizing the lip language of the user through video data, the voice recognition has the advantages of high recognition speed and low recognition cost.
In a possible embodiment, after obtaining the lip language recognition result and the voice recognition result, the similarity ranking may also be obtained by:
the first method is as follows: determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result; for each throwing device, calculating a first similarity of the accuracy rate of the throwing device and other throwing devices, a second similarity of the occupation ratio and a third similarity of the number of continuous wrong characters; weighting and summing the first similarity, the second similarity and the third similarity corresponding to the throwing device to obtain similarities which are used for representing that the throwing device and other throwing devices are influenced by environment and noise; and sequencing the similarity according to a specified sequence to obtain a similarity sequence for representing the similar situation of the delivering device and other delivering devices influenced by the surrounding environment and noise.
The second method comprises the following steps: performing text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the voice recognition result to obtain text similarity corresponding to the releasing equipment; for each delivery device, calculating a second difference value between the text similarity of the delivery device and the text similarity of other delivery devices; and sorting the second difference values according to a specified sequence to obtain a similarity sorting which is used for representing the similar situation of the projection device and other projection devices influenced by the surrounding environment and noise.
In a possible implementation, fig. 2 is a schematic flowchart of another speech recognition method provided in an embodiment of the present application, and as shown in fig. 2, when step 111 is executed, the method may be implemented by:
step 201, aligning the target audio signal and audio frames with the same acquisition time in the audio signal acquired by the launching device to obtain a plurality of audio frame groups with the same audio acquisition time.
Step 202, for each group of audio frames, using the audio features of the audio frames corresponding to the target audio signals in the group of audio frames to perform audio separation on the audio frames corresponding to the audio signals collected by the launching device, so as to obtain the user audio frames collected when the launching device corresponding to the group of audio frames enables the voice function again.
And 203, splicing the user audio frames corresponding to the plurality of audio frame groups to obtain the user audio collected when the voice function of the launching device is enabled again.
Specifically, the target audio signal and the audio signal collected by the delivering device have the same recording start time point, so that the audio collected by the two delivering devices can be aligned in time, the audio frames with the same time stamp are divided into the same group, for the audio frames in the same group, the current environments of the two corresponding delivering devices are relatively similar, after the video frames are aligned, the audio frames in the audio signal collected by the delivering device can be subjected to audio separation by using the audio frames corresponding to the target audio signal, and when the audio frames are subjected to audio separation, the audio frame groups can be processed in parallel, so that the data processing speed is improved.
It should be noted that, the splicing method specifically used in audio splicing may be set according to actual needs, and is not specifically limited herein.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used to illustrate the technical solutions of the present application, but not to limit the technical solutions, and the scope of the present application is not limited to the above-mentioned embodiments, although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (3)

1. A method of speech recognition, the method comprising:
in a preset time period, when the voice function of the releasing equipment is enabled, a voice acquisition module and a video data acquisition module are started simultaneously to acquire audio data and video data with the same time attribute;
marking two mouth corners of the lips and the lips of the upper lip and the lower lip collected in the video data by adopting a four-point marking method;
measuring a mouth angle opening angle, a distance between an upper lip and a lower lip and a distance between the two mouth angles when the mouth shape changes every time in the video data to obtain multiple groups of measurement data, wherein the mouth angle opening angle comprises a first included angle and a second included angle, the first included angle is an included angle between a connecting line between the upper lip and the mouth angle and a connecting line between the two mouth angles, and the second included angle is an included angle between the connecting line between the lower lip and the mouth angle and the connecting line between the two mouth angles;
inputting the mouth angle opening angle, the distance and the distance in each group of measurement data as input parameters into an evaluation function to obtain lip change data, wherein the evaluation function is as follows: pre ═ k ═ Angle (a, b) + q · Line (LAF, LDE)), k, p, q are weight coefficients of respective cost functions, Angle (a, b) is a cost function of the first Angle and the second Angle, Line (LAF, LDE) is a cost function of the pitch and the distance;
comparing the lip change data with a lip database to obtain a lip language recognition result, and performing audio recognition on the audio data to obtain a voice recognition result;
determining the accuracy of the voice recognition result, the ratio of the wrongly recognized characters in the voice recognition result in each division area of the whole voice and the quantity of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result;
performing weighted summation on the accuracy, the percentage and the quantity to obtain a voice recognition characteristic value for representing the voice recognition characteristic of the delivery device;
after the voice recognition characteristic values of the releasing devices are obtained, calculating a first difference value between the voice recognition characteristic values of other releasing devices and the voice recognition characteristic value of the releasing device for each releasing device;
sequencing the first difference values according to a specified sequence to obtain a similarity sequencing for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise;
after the similarity ranking corresponding to the releasing equipment is obtained, when the releasing equipment enables the voice function again, selecting target releasing equipment from other releasing equipment according to the similarity ranking corresponding to the releasing equipment, and starting a voice acquisition module of the target releasing equipment to acquire a target audio signal, wherein the target releasing equipment is the releasing equipment with the highest similarity ranking in the releasing equipment which is not used by a user when the releasing equipment enables the voice function again;
according to the audio characteristics of the target audio signal, audio separation is carried out on the audio signal collected by the throwing equipment so as to obtain user audio collected when the throwing equipment enables the voice function again;
and carrying out voice recognition according to the user audio to obtain a voice instruction of the user.
2. The method of claim 1, wherein after obtaining the lip language recognition result and the voice recognition result, the similarity ranking is further obtained by:
determining the accuracy of the voice recognition result, the proportion of the wrongly recognized characters in the voice recognition result in all division areas of the whole voice and the number of continuous wrongly recognized characters in the voice recognition result according to the lip language recognition result; for each throwing device, calculating a first similarity of the accuracy rate of the throwing device and other throwing devices, a second similarity of the occupation ratio and a third similarity of the number of continuous wrong characters; weighting and summing the first similarity, the second similarity and the third similarity corresponding to the throwing device to obtain similarities which are used for representing that the throwing device and other throwing devices are influenced by environment and noise; sequencing the similarity according to a specified sequence to obtain a similarity sequence for representing the similar situation of the throwing equipment and other throwing equipment influenced by the surrounding environment and noise; or,
performing text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the voice recognition result to obtain text similarity corresponding to the releasing equipment; for each delivery device, calculating a second difference value between the text similarity of the delivery device and the text similarity of other delivery devices; and sorting the second difference values according to a specified sequence to obtain a similarity sorting which is used for representing the similar situation of the projection device and other projection devices influenced by the surrounding environment and noise.
3. The method of claim 1, wherein performing audio separation on the audio signal collected by the delivery device according to the audio feature of the target audio signal to obtain the user audio collected when the delivery device enables the voice function again comprises:
aligning the target audio signal with audio frames with the same acquisition time in the audio signal acquired by the launching device to obtain a plurality of audio frame groups with the same audio acquisition time;
for each group of audio frames, using the audio characteristics of the audio frame corresponding to the target audio signal in the group of audio frames to perform audio separation on the audio frame corresponding to the audio signal acquired by the launching device, so as to obtain the user audio frame acquired when the launching device corresponding to the group of audio frames enables the voice function again;
and splicing the user audio frames corresponding to the plurality of audio frame groups to obtain the user audio collected when the voice function of the releasing equipment is enabled again.
CN202210328042.2A 2022-03-31 2022-03-31 Speech recognition method Active CN114420124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210328042.2A CN114420124B (en) 2022-03-31 2022-03-31 Speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210328042.2A CN114420124B (en) 2022-03-31 2022-03-31 Speech recognition method

Publications (2)

Publication Number Publication Date
CN114420124A CN114420124A (en) 2022-04-29
CN114420124B true CN114420124B (en) 2022-06-24

Family

ID=81264335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210328042.2A Active CN114420124B (en) 2022-03-31 2022-03-31 Speech recognition method

Country Status (1)

Country Link
CN (1) CN114420124B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
CN111863014A (en) * 2019-04-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN112037788A (en) * 2020-09-10 2020-12-04 中航华东光电(上海)有限公司 Voice correction fusion technology
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device
WO2021262737A1 (en) * 2020-06-24 2021-12-30 Netflix, Inc. Systems and methods for correlating speech and lip movement

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863014A (en) * 2019-04-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Audio processing method and device, electronic equipment and readable storage medium
CN111028842A (en) * 2019-12-10 2020-04-17 上海芯翌智能科技有限公司 Method and equipment for triggering voice interaction response
WO2021262737A1 (en) * 2020-06-24 2021-12-30 Netflix, Inc. Systems and methods for correlating speech and lip movement
CN112037788A (en) * 2020-09-10 2020-12-04 中航华东光电(上海)有限公司 Voice correction fusion technology
CN113782048A (en) * 2021-09-24 2021-12-10 科大讯飞股份有限公司 Multi-modal voice separation method, training method and related device

Also Published As

Publication number Publication date
CN114420124A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
KR101157073B1 (en) Method for finger language recognition using emg and gyro sensor and apparatus thereof
US20140156215A1 (en) Gait analysis system and method
CN108388895B (en) Machine learning-based automatic processing method for test paper answer sheet
CN107622797A (en) A kind of health based on sound determines system and method
CN108171437B (en) Psychological occupational ability evaluation system
CN109783693B (en) Method and system for determining video semantics and knowledge points
CN111599438A (en) Real-time diet health monitoring method for diabetic patient based on multi-modal data
Pham et al. Multimodal detection of Parkinson disease based on vocal and improved spiral test
EP3147831B1 (en) Information processing device and information processing method
CN103249353B (en) Method for determining physical constitutions using integrated information
CN114420124B (en) Speech recognition method
CN117789971A (en) Mental health intelligent evaluation system and method based on text emotion analysis
CN107169264B (en) complex disease diagnosis system
CN107871113B (en) Emotion hybrid recognition detection method and device
CN111090989A (en) Prompting method based on character recognition and electronic equipment
CN110020686A (en) A kind of road surface method for detecting abnormality based on intelligent perception sensing data
Elworthy Language identification with confidence limits
CN110119464B (en) Intelligent recommendation method and device for numerical values in contract
CN112927681B (en) Artificial intelligence psychological robot and method for recognizing speech according to person
US20210170228A1 (en) Information processing apparatus, information processing method and program
CN114999598B (en) Method and system for acquiring clinical experiment data, electronic equipment and storage medium
Maji et al. A Novel Technique for Detecting Depressive Disorder: A Speech Database-Based Approach
CN111435481A (en) Method and device for sorting new high-examination subject selection combinations
CN105786362B (en) A kind of based reminding method and device based on pose information
CN116942160A (en) Cognitive function detection method, system and robot

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20241022

Address after: Building 1, No. 168 Ludang Road, Wujiang Economic and Technological Development Zone, Suzhou City, Jiangsu Province 215200

Patentee after: Suzhou miaoyijia Health Technology Group Co.,Ltd.

Country or region after: China

Address before: 100027 05-169, 5th floor, building 1, yard 40, Xiaoyun Road, Chaoyang District, Beijing

Patentee before: Beijing miaoyijia Health Technology Group Co.,Ltd.

Country or region before: China