CN114420124B

CN114420124B - Speech recognition method

Info

Publication number: CN114420124B
Application number: CN202210328042.2A
Authority: CN
Inventors: 赵进; 刘邦长; 赵红文
Original assignee: Beijing Miaoyijia Health Technology Group Co ltd
Current assignee: Suzhou Miaoyijia Health Technology Group Co ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-06-24
Anticipated expiration: 2042-03-31
Also published as: CN114420124A

Abstract

The application provides a voice recognition method, in the application, after recognizing the similar throwing devices of the environment in which the current moment is located, when carrying out voice recognition, two throwing devices are started simultaneously, one throwing device is used for collecting audio signals containing user audio, the other throwing device is used for collecting mixed audio of audio and noise in the environment, then audio separation is carried out on the audio signals containing the user audio through the mixed audio, so that the user audio which is less influenced by the environment is obtained, then voice recognition is carried out on the user audio, and therefore a voice instruction of a user is obtained.

Description

A method of speech recognition

技术领域technical field

本申请涉及语音识别技术领域，具体而言，涉及一种语音识别方法。The present application relates to the technical field of speech recognition, and in particular, to a speech recognition method.

背景技术Background technique

随着科技的发展，语音识别技术被应用到越来越多的领域，以便辅助用户使用设备。With the development of science and technology, speech recognition technology is applied to more and more fields to assist users in using equipment.

对于投放到市场中的医疗检测设备，为了方便患者自助使用，在医疗检测设备上安装有语音识别功能，患者可以通过语音指令使能对应的功能（即：启动医疗检测设备上对应的功能），例如：患者的语音可以为“我想检测血压”，那么医疗检测设备通过语音识别功能识别出文本信息，然后通过文本信息的语义确定患者想要使能的功能后启动血压检测功能。For medical testing equipment put on the market, in order to facilitate self-service use by patients, a voice recognition function is installed on the medical testing equipment. For example: the patient's voice can be "I want to detect blood pressure", then the medical detection device recognizes the text information through the voice recognition function, and then determines the function that the patient wants to enable through the semantics of the text information, and then starts the blood pressure detection function.

对于投放到市场中的医疗检测设备在进行语音识别时，会受到周围环境音频和噪声的影响，例如：行驶在街道上车辆发出的声音、周围人群发出的声音，以及周围环境中的电磁噪声等，这些环境音频和噪声会对语音识别造成影响，从而导致语音识别结果不准确。When performing speech recognition on medical testing equipment on the market, it will be affected by the surrounding audio and noise, such as: the sound of vehicles driving on the street, the sound of surrounding people, and the electromagnetic noise in the surrounding environment, etc. , these ambient audio and noise will affect speech recognition, resulting in inaccurate speech recognition results.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请实施例提供了一种语音识别方法，以提高语音识别结果的准确性。In view of this, the embodiments of the present application provide a speech recognition method to improve the accuracy of the speech recognition result.

本申请提供了一种语音识别方法，所述方法包括：The application provides a speech recognition method, the method includes:

在预设时间段内，当投放设备使能语音功能时，同时启动语音采集模块和视频数据采集模块，以采集到具有相同时间属性的音频数据和视频数据；Within the preset time period, when the delivery device enables the voice function, the voice acquisition module and the video data acquisition module are simultaneously activated to collect audio data and video data with the same time attribute;

采用四点标注法对所述视频数据中采集的嘴唇的两个嘴角，以及上下嘴唇的唇中进行标注；The four-point labeling method is used to label the two corners of the lips and the lips of the upper and lower lips collected in the video data;

测量所述视频数据中每一次嘴型发生变化时的嘴角张开角度、上下唇之间的间距，以及两个嘴角之间的距离，以得到多组测量数据，其中，所述嘴角张开角度包括第一夹角和第二夹角，第一夹角为上嘴唇唇中和嘴角之间的连线与两个嘴角之间的连线之间的夹角，第二夹角为下嘴唇唇中和嘴角之间的连线与两个嘴角之间的连线之间的夹角；Measure the opening angle of the mouth corner, the distance between the upper and lower lips, and the distance between the two mouth corners when the mouth shape changes each time in the video data, so as to obtain multiple sets of measurement data, wherein the mouth corner opening angle Including a first included angle and a second included angle, the first included angle is the included angle between the connection line between the upper lip lip and the corner of the mouth and the connecting line between the two mouth corners, and the second included angle is the lower lip lip The angle between the line between the center and the corners of the mouth and the line between the two corners of the mouth;

将每一组测量数据中的所述嘴角张开角度、所述间距和所述距离作为输入参数输入至评价函数，得到嘴唇变化数据，其中，所述语音评价函数为：Pre＝k*(p*Angle(a,b)+q*Line(LAF,LDE))，k、p、q为各代价函数的权重系数，Angle(a,b)为所述第一夹角和所述第二夹角的代价函数， Line(LAF,LDE)为所述间距和所述距离的代价函数；The mouth opening angle, the spacing and the distance in each group of measurement data are input to the evaluation function as input parameters to obtain lip change data, wherein the speech evaluation function is: Pre=k*(p *Angle(a,b)+q*Line(LAF,LDE)), k, p, q are the weight coefficients of each cost function, Angle(a,b) is the first angle and the second angle The cost function of the angle, Line(LAF, LDE) is the cost function of the distance and the distance;

使用所述嘴唇变化数据和唇形数据库进行比对得到唇语识别结果，以及对所述音频数据进行音频识别得到语音识别结果；Using the lip change data and the lip shape database to compare and obtain a lip language recognition result, and carry out audio recognition to the audio data to obtain a speech recognition result;

依据所述唇语识别结果确定所述语音识别结果的准确率、所述语音识别结果中识别错误的文字在整段语音的各划分区间中的占比，以及所述语音识别结果中连续错字的数量；According to the lip language recognition result, determine the accuracy rate of the speech recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the percentage of consecutive typos in the speech recognition result. quantity;

对所述准确率、所述占比和所述数量进行加权求和，以得到用于表示所述投放设备的语音识别特性的语音识别特征值；Weighted summation is performed on the accuracy rate, the proportion and the quantity to obtain a speech recognition feature value representing the speech recognition feature of the delivery device;

在得到各所述投放设备的语音识别特征值后，对于每个所述投放设备，计算其他投放设备的语音识别特征值和该投放设备的语音识别特征值之间的第一差值；After obtaining the speech recognition feature value of each of the delivery devices, for each of the delivery devices, calculate the first difference between the speech recognition feature values of other delivery devices and the speech recognition feature value of the delivery device;

按照指定顺序对所述第一差值进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序；Sorting the first difference values according to a specified order to obtain a similarity ranking indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise;

在得到该投放设备对应的相似度排序后，当该投放设备再次使能语音功能时，按照该投放设备对应的相似度排序从其他投放设备中选择目标投放设备，并启动所述目标投放设备的语音采集模块采集目标音频信号，其中，所述目标投放设备为在该投放设备再次使能语音功能时未被用户使用的投放设备中相似度排序最高的投放设备；After the similarity ranking corresponding to the delivery device is obtained, when the voice function is enabled on the delivery device again, a target delivery device is selected from other delivery devices according to the similarity ranking corresponding to the delivery device, and the target delivery device is activated. The voice collection module collects the target audio signal, wherein the target delivery device is the delivery device with the highest similarity ranking among the delivery devices not used by the user when the delivery device enables the voice function again;

根据所述目标音频信号的音频特征，对该投放设备采集到的音频信号进行音频分离，以得到该投放设备再次使能语音功能时采集到的用户音频；According to the audio feature of the target audio signal, audio separation is performed on the audio signal collected by the delivery device, so as to obtain the user audio collected when the delivery device enables the voice function again;

根据所述用户音频进行语音识别，得到用户的语音指令。Perform voice recognition according to the user audio to obtain the user's voice command.

可选地，在得到所述唇语识别结果和所述语音识别结果后，所述相似度排序还可以通过以下步骤得到：Optionally, after obtaining the lip language recognition result and the speech recognition result, the similarity ranking can also be obtained through the following steps:

依据所述唇语识别结果确定所述语音识别结果的准确率、所述语音识别结果中识别错误的文字在整段语音的各划分区间中的占比，以及所述语音识别结果中连续错字的数量；对于每个所述投放设备，计算该投放设备与其他各投放设备的准确率的第一相似度、占比的第二相似度和连续错字的数量的第三相似度；对该投放设备对应的所述第一相似度、所述第二相似度和所述第三相似度进行加权求和，得到用于表示该投放设备与其他各投放设备受环境和噪声影响的相似度；按照指定顺序对所述相似度进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序；或者，According to the lip language recognition result, determine the accuracy rate of the speech recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the percentage of consecutive typos in the speech recognition result. Quantity; for each delivery device, calculate the first similarity of the accuracy rate, the second similarity of the proportion, and the third similarity of the number of consecutive typos between the delivery device and other delivery devices; The corresponding first similarity, the second similarity and the third similarity are weighted and summed to obtain the similarity used to indicate that the delivery device and other delivery devices are affected by the environment and noise; according to the specified Ranking the similarity in order to obtain a similarity ranking indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise; or,

对所述唇语识别结果对应的文本信息和所述语音识别结果对应的文本信息进行文本相似度计算，得到所述投放设备对应的文本相似度；对于每个所述投放设备，计算该投放设备的文本相似度与其他投放设备的文本相似度之间的第二差值；按照指定顺序对所述第二差值进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序。Perform text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the speech recognition result to obtain the text similarity corresponding to the delivery device; for each delivery device, calculate the delivery device The second difference value between the text similarity of , and the text similarity of other delivery devices; sort the second difference according to the specified order to obtain a value used to represent that the delivery device and other delivery devices are affected by the surrounding environment and noise. Affects similarity ranking for similar situations.

可选地，根据所述目标音频信号的音频特征，对该投放设备采集到的音频信号进行音频分离，以得到该投放设备再次使能语音功能时采集到的用户音频，包括：Optionally, according to the audio characteristics of the target audio signal, audio separation is performed on the audio signal collected by the delivery device to obtain the user audio collected when the delivery device enables the voice function again, including:

将所述目标音频信号和该投放设备采集到的音频信号中具有相同采集时刻的音频帧进行对齐，以得到多组具有相同音频采集时刻的音频帧组；Aligning the audio frame with the same acquisition moment in the audio signal collected by the target audio signal and the throwing device, to obtain multiple groups of audio frame groups with the same audio acquisition moment;

对于每组音频帧，使用该组音频帧中所述目标音频信号对应的音频帧的音频特征，对该投放设备采集到的音频信号对应的音频帧进行音频分离，以得到该音频帧组对应的该投放设备再次使能语音功能时采集到的用户音频帧；For each group of audio frames, use the audio feature of the audio frame corresponding to the target audio signal in the group of audio frames to perform audio separation on the audio frame corresponding to the audio signal collected by the delivery device to obtain the corresponding audio frame group. User audio frames collected when the delivery device enables the voice function again;

将多个音频帧组对应的用户音频帧进行拼接，得到该投放设备再次使能语音功能时采集到的用户音频。The user audio frames corresponding to the multiple audio frame groups are spliced to obtain the user audio collected when the delivery device enables the voice function again.

由于周围环境中的音频和噪声会随着时间的推移发生变化，例如：早高峰和晚高峰时周围环境中的音频和噪声会高于其他时间段，因此在本申请中需要周期性的采集音频数据以便确定出在对应时间段内周围环境中的音频和噪声相似的投放设备，在确定周围环境中的音频和噪声相似的投放设备时，对于每个设备先确定出在当前时间段内的唇语识别结果和语音识别结果，由于唇语识别结果不会受到周围环境中的音频和噪声的影响，因此可以利用唇语识别结果确定语音识别结果的准确率，以及语音识别结果中识别错误的文字在整段语音的各划分区间中的占比和语音识别结果中连续错字的数量，其中，语音识别结果的准确率用于表征周围环境中的音频和噪声对语音识别结果的影响程度，语音识别结果中识别错误的文字在整段语音的各划分区间中的占比用于表征识别错误的文字在整段语音中的分布情况，语音识别结果中连续错字的数量用于表示周围环境中突发的音频和噪声的出现情况，使用上述三种参数来确定出其他投放设备的语音识别特征值和该投放设备的语音识别特征值之间的第一差值，该第一差值用于表示该投放设备与其他各投放设备所在环境的相似度情况，然后对第一差值进行排序，得到的排序结果用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序，在使用某一投放设备再次使能语音功能时，为了降低环境中的音频和噪声对该投放设备的影响，在使能该投放设备的语音功能时，也同时启动未被用户使用的投放设备中相似度排序最高的目标投放设备的语音采集模块，此时该投放设备可以采集到混合音频，目标投放设备能够采集到由周围环境中的音频和噪声构成的混合音频，然后通过目标投放设备中的混合音频的音频特性特征对该投放设备采集到的音频信号进行音频分离，以得到该投放设备采集到的用户的音频，由于得到的用户的音频中降低了周围环境中的音频和噪声对其的影响，因此有利于提高语音识别结果的准确率，进而有利于提高得到语音指令的准确性。Since the audio and noise in the surrounding environment will change over time, for example, the audio and noise in the surrounding environment will be higher than other time periods during the morning peak and evening peak, so in this application, it is necessary to periodically collect audio In order to determine the delivery devices with similar audio and noise in the surrounding environment in the corresponding time period, when determining the delivery devices with similar audio and noise in the surrounding environment, for each device, first determine the lip in the current time period. Speech recognition results and speech recognition results. Since the lip recognition results are not affected by the audio and noise in the surrounding environment, the lip recognition results can be used to determine the accuracy of the speech recognition results, as well as the wrongly recognized words in the speech recognition results. The proportion in each division of the entire speech and the number of consecutive typos in the speech recognition result, where the accuracy of the speech recognition result is used to represent the influence of the audio and noise in the surrounding environment on the speech recognition result, and the speech recognition The proportion of the incorrectly recognized words in each division of the whole speech in the result is used to represent the distribution of the incorrectly recognized words in the whole speech. The occurrence of audio and noise, the above three parameters are used to determine the first difference between the speech recognition feature value of other delivery devices and the speech recognition feature value of the delivery device, and the first difference is used to represent the The similarity of the environment where the delivery device and other delivery devices are located, and then the first difference is sorted, and the obtained sorting result is used to indicate the similarity ranking of the delivery device and other delivery devices affected by the surrounding environment and noise. When using a delivery device to enable the voice function again, in order to reduce the impact of audio and noise in the environment on the delivery device, when enabling the voice function of the delivery device, the delivery device that is not used by the user is also activated at the same time. The voice acquisition module of the target delivery device with the highest similarity ranking. At this time, the delivery device can collect mixed audio, and the target delivery device can collect the mixed audio composed of audio and noise in the surrounding environment, and then pass the audio in the target delivery device. The audio characteristics of the mixed audio perform audio separation on the audio signal collected by the throwing device to obtain the user's audio collected by the throwing device. Since the obtained user's audio reduces the impact of the audio and noise in the surrounding environment on it. Therefore, it is beneficial to improve the accuracy of the speech recognition result, which in turn is beneficial to improve the accuracy of obtaining the speech command.

为使本申请的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present application more obvious and easy to understand, the preferred embodiments are exemplified below, and are described in detail as follows in conjunction with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following drawings will briefly introduce the drawings that need to be used in the embodiments. It should be understood that the following drawings only show some embodiments of the present application, and therefore do not It should be regarded as a limitation of the scope, and for those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

图1为本申请实施例提供的一种语音识别方法的流程示意图；1 is a schematic flowchart of a speech recognition method provided by an embodiment of the present application;

图2为本申请实施例提供的另一种语音识别方法的流程示意图。FIG. 2 is a schematic flowchart of another speech recognition method provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。因此，以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围，而是仅仅表示本申请的选定实施例。基于本申请的实施例，本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. The components of the embodiments of the present application generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the application provided in the accompanying drawings is not intended to limit the scope of the application as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present application.

以下实施例可以应用到人的健康管理方面，以辅助人们更加健康的生活。以下是对本申请的详细说明。The following embodiments can be applied to human health management to assist people in a healthier life. The following is a detailed description of the present application.

图1为本申请实施例提供的一种语音识别方法的流程示意图，如图1所示，该语音识别方法包括以下步骤：FIG. 1 is a schematic flowchart of a speech recognition method provided by an embodiment of the present application. As shown in FIG. 1 , the speech recognition method includes the following steps:

步骤101、在预设时间段内，当投放设备使能语音功能时，同时启动语音采集模块和视频数据采集模块，以采集到具有相同时间属性的音频数据和视频数据。Step 101: Within a preset time period, when the delivery device enables the voice function, the voice collection module and the video data collection module are simultaneously activated to collect audio data and video data with the same time attribute.

具体的，投放的设备类型包括血压检测设备、心率检测设备或者是多功能一体检测设备等，投放设备包括多个，并且被投放至不同的位置，以便对某一区域进行检测覆盖，由于周围环境中的音频和噪声会随着时间的推移发生变化，例如：早高峰和晚高峰时周围环境中的音频和噪声会高于其他时间段，因此需要周期性的采集音频数据以便确定出在对应时间段内周围环境中的音频和噪声对语音识别结果的影响，此时需要同时启动语音采集模块和视频数据采集模块，以采集到具有相同时间属性的音频数据和视频数据，由于视频数据中的唇语不会受到周围环境中的音频和噪声的影响，因此可以根据视频数据确定周围环境中的音频和噪声对语音识别结果的影响程度。Specifically, the types of devices to be placed include blood pressure detection devices, heart rate detection devices, or multi-function integrated detection devices, etc. There are multiple placement devices, and they are placed in different locations to detect and cover a certain area. Due to the surrounding environment The audio and noise will change over time, for example: the audio and noise in the surrounding environment will be higher than other time periods during the morning peak and evening peak, so it is necessary to collect audio data periodically to determine the corresponding time. The influence of audio and noise in the surrounding environment in the segment on the speech recognition results. At this time, it is necessary to start the speech acquisition module and the video data acquisition module at the same time to collect audio data and video data with the same time attribute. Speech is not affected by the audio and noise in the surrounding environment, so the degree of influence of the audio and noise in the surrounding environment on the speech recognition result can be determined according to the video data.

步骤102、采用四点标注法对所述视频数据中采集的嘴唇的两个嘴角，以及上下嘴唇的唇中进行标注。Step 102: Use the four-point labeling method to label the two corners of the lips and the upper and lower lips collected in the video data.

具体的，由于不同的发音对应的嘴型是不一样的，这里的嘴型不一样包括：张嘴的大小、两个嘴角之间的距离、以及两个嘴角之间的连线与上嘴唇之间的夹角，两个嘴角之间的连线与下嘴唇之间的夹角等，为了得到这些数据，需要对视频数据中采集的嘴唇的两个嘴角以及上下嘴唇的唇中进行标注，从而根据标注点之间的连线所得到数据来得到唇语识别结果。Specifically, since different pronunciations correspond to different mouth shapes, the different mouth shapes here include: the size of the open mouth, the distance between the two corners of the mouth, and the line between the two corners of the mouth and the upper lip In order to obtain these data, it is necessary to mark the two mouth corners of the lips and the lips of the upper and lower lips collected in the video data, so that according to the The data obtained by marking the connection between the points is used to obtain the lip language recognition result.

需要说明的是，也可以采用其他数量的标注发，例如：六点标注法，六个点位分别为两边嘴角处为A、F点，上嘴唇中左右嘴唇的中点为B、G点，下嘴唇中左右嘴唇的中点为C、H点，关于具体的标注方法可以根据实际需要的精度进行设定，在此不再详细说明。It should be noted that other numbers of labels can also be used, for example: the six-point labeling method, the six points are points A and F at the corners of the mouth on both sides, and the midpoints of the left and right lips in the upper lip are points B and G. The midpoints of the left and right lips in the lower lip are points C and H. The specific labeling method can be set according to the actual required accuracy, and will not be described in detail here.

步骤103、测量所述视频数据中每一次嘴型发生变化时的嘴角张开角度、上下唇之间的间距，以及两个嘴角之间的距离，以得到多组测量数据，其中，所述嘴角张开角度包括第一夹角和第二夹角，第一夹角为上嘴唇唇中和嘴角之间的连线与两个嘴角之间的连线之间的夹角，第二夹角为下嘴唇唇中和嘴角之间的连线与两个嘴角之间的连线之间的夹角。Step 103: Measure the opening angle of the corner of the mouth, the distance between the upper and lower lips, and the distance between the two corners of the mouth when each mouth shape changes in the video data, to obtain multiple sets of measurement data, wherein the corners of the mouth are The opening angle includes a first included angle and a second included angle, the first included angle is the included angle between the line connecting the middle of the upper lip and the corners of the mouth and the connecting line between the two corners of the mouth, and the second included angle is The angle between the line between the middle of the lower lip and the corners of the mouth and the line between the two corners of the mouth.

步骤104、将每一组测量数据中的所述嘴角张开角度、所述间距和所述距离作为输入参数输入至评价函数，得到嘴唇变化数据，其中，所述语音评价函数为：Pre＝k*(p*Angle(a,b)+q*Line(LAF,LDE))，k、p、q为各代价函数的权重系数，Angle(a,b)为所述第一夹角和所述第二夹角的代价函数， Line(LAF,LDE)为所述间距和所述距离的代价函数。Step 104: Input the mouth opening angle, the spacing and the distance in each group of measurement data as input parameters to the evaluation function to obtain lip change data, wherein the speech evaluation function is: Pre=k *(p*Angle(a,b)+q*Line(LAF,LDE)), k, p, q are the weight coefficients of each cost function, Angle(a,b) is the first angle and the The cost function of the second included angle, Line(LAF, LDE) is the cost function of the distance and the distance.

需要说明的是，LAF为两个嘴角之间的距离，LDE为上下唇之间的间距，由于不同的地区的发音方法是不同的，因此p、q、k可以根据投放设备的都放区域进行设置。It should be noted that LAF is the distance between the two corners of the mouth, and LDE is the distance between the upper and lower lips. Since different regions have different pronunciation methods, p, q, and k can be determined according to the placement area of the device. set up.

步骤105、使用所述嘴唇变化数据和唇形数据库进行比对得到唇语识别结果，以及对所述音频数据进行音频识别得到语音识别结果。Step 105: Use the lip change data and the lip shape database to compare to obtain a lip language recognition result, and perform audio recognition on the audio data to obtain a speech recognition result.

步骤106、依据所述唇语识别结果确定所述语音识别结果的准确率、所述语音识别结果中识别错误的文字在整段语音的各划分区间中的占比，以及所述语音识别结果中连续错字的数量。Step 106: Determine the accuracy of the speech recognition result according to the lip language recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the speech recognition result. The number of consecutive typos.

具体的，语音识别结果的准确率用于表征周围环境中的音频和噪声对语音识别结果整体的影响程度，语音识别结果中识别错误的文字在整段语音的各划分区间中的占比用于表征识别错误的文字在整段语音中的分布情况，语音识别结果中连续错字的数量用于表示周围环境中突发的音频和噪声的出现情况，通过上述三种参数可以确定出周围环境中的音频和噪声的大体情况。Specifically, the accuracy of the speech recognition result is used to represent the degree of influence of audio and noise in the surrounding environment on the overall speech recognition result, and the proportion of incorrectly recognized words in the speech recognition result in each division of the entire speech is used for Characterize the distribution of the wrongly recognized words in the whole speech. The number of consecutive wrong words in the speech recognition result is used to indicate the occurrence of sudden audio and noise in the surrounding environment. The above three parameters can be used to determine the surrounding environment. General information about audio and noise.

步骤107、对所述准确率、所述占比和所述数量进行加权求和，以得到用于表示所述投放设备的语音识别特性的语音识别特征值。Step 107: Perform a weighted summation on the accuracy rate, the proportion and the number to obtain a speech recognition feature value used to represent the speech recognition feature of the delivery device.

具体的，由于不同的数据对语音识别结果准确率的影响是不同的，因此需要为上述三种参数分配不同的权重，然后进行加权求和，以将得到的结果作为投放设备的语音识别特性的语音识别特征值。Specifically, since different data have different influences on the accuracy of the speech recognition result, it is necessary to assign different weights to the above three parameters, and then perform a weighted summation, so as to use the obtained result as a result of the speech recognition characteristics of the delivery device. Speech recognition feature value.

需要说明的是，上述三种参数的具体权重分配情况可以根据实际需要进行设定，在此不做具体限定。It should be noted that the specific weight distribution of the above three parameters can be set according to actual needs, which is not specifically limited here.

步骤108、在得到各所述投放设备的语音识别特征值后，对于每个所述投放设备，计算其他投放设备的语音识别特征值和该投放设备的语音识别特征值之间的第一差值。Step 108: After obtaining the speech recognition feature values of each of the delivery devices, for each of the delivery devices, calculate the first difference between the speech recognition feature values of other delivery devices and the speech recognition feature value of the delivery device. .

具体的，第一差值用于表征两个投放设备所在环境的相似度情况，其中，第一差值越小，表示这两个投放设备所在环境的相似度越高。Specifically, the first difference is used to represent the similarity of the environments where the two delivery devices are located, wherein the smaller the first difference is, the higher the similarity of the environments where the two delivery devices are located.

步骤109、按照指定顺序对所述第一差值进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序。Step 109: Rank the first difference values according to a specified order to obtain a similarity ranking indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise.

具体的，在对第一差值排序后，就可以知道某一投放设备与其他投放设备所在环境的相似度情况，从而可以根据该排序确定出该投放设备与其他投放设备受周围环境和噪声影响相似情况的排名，例如：投放设备包括：投放设备1、投放设备2、投放设备3和投放设备4，以投放设备1为例，在得到投放设备1与投放设备2之间的差值，投放设备1与投放设备3之间的差值，以及投放设备1与投放设备4之间的差值后，对上述三个差值进行排序，从而可以得到投放设备2、投放设备3和投放设备4分别与投放设备1所在环境的相似度排名。Specifically, after sorting the first difference, it is possible to know the similarity of the environment where a certain delivery device and other delivery devices are located, so that it can be determined according to the ranking that the delivery device and other delivery devices are affected by the surrounding environment and noise Ranking of similar situations, for example, the placement devices include: placement device 1, placement device 2, placement device 3, and placement device 4. Taking placement device 1 as an example, after obtaining the difference between placement device 1 and placement device 2, placement After the difference between device 1 and delivery device 3, and the difference between delivery device 1 and delivery device 4, the above three differences are sorted, so that delivery device 2, delivery device 3 and delivery device 4 can be obtained. The similarity ranking with the environment where the delivery device 1 is located.

需要说明的是，对于不同的投放设备，得到的相似度排名可以是不同的，例如，对投放设备1的所在环境进行排名后，得到的相似度最高的投放设备可能是投放设备2，但是对投放设备2的所在环境进行排名后，得到的相似度最高的投放设备可能是投放设备3。It should be noted that, for different delivery devices, the obtained similarity rankings may be different. For example, after ranking the environment where delivery device 1 is located, the delivery device with the highest similarity may be delivery device 2, but for After ranking the environments where the placement device 2 is located, the placement device with the highest similarity may be the placement device 3 .

步骤110、在得到该投放设备对应的相似度排序后，当该投放设备再次使能语音功能时，按照该投放设备对应的相似度排序从其他投放设备中选择目标投放设备，并启动所述目标投放设备的语音采集模块采集目标音频信号，其中，所述目标投放设备为在该投放设备再次使能语音功能时未被用户使用的投放设备中相似度排序最高的投放设备。Step 110: After obtaining the similarity ranking corresponding to the delivery device, when the delivery device enables the voice function again, select a target delivery device from other delivery devices according to the similarity ranking corresponding to the delivery device, and activate the target delivery device. The voice collection module of the delivery device collects the target audio signal, wherein the target delivery device is the delivery device with the highest similarity ranking among the delivery devices not used by the user when the delivery device enables the voice function again.

具体的，在得到各个投放设备对应的相似度排名后，在某一投放设备再次被用户使用时，在使能该投放设备的语音功能时，需要同时启动其他投放设备中未被使用，且与该投放设备所在环境相似度最高的投放设备，以投放设备1被再次使用语音功能，且与投放设备1所在环境的相似度由高到低的排名为投放设备2、投放设备3和投放设备4，当投放设备1被再次使用语音功能时，需要检测投放设备2是否被使用，如果未被使用，则在启动投放设备1的语音功能时，同时启动投放设备2的语音采集模块，以便投放设备2可以采集到周围环境的音频和噪声，如果投放设备2当前被使用，则需要检测投放设备3是否被使用，如果未被使用，则在启动投放设备1的语音功能时，同时启动投放设备3的语音采集模块，以便投放设备3可以采集到周围环境的音频和噪声，以此类推，直至得到目标投放设备为止。Specifically, after obtaining the similarity ranking corresponding to each delivery device, when a certain delivery device is used by the user again, when enabling the voice function of the delivery device, it is necessary to simultaneously activate other delivery devices that are not used, and are not in use with the delivery device. The delivery device with the highest similarity in the environment where the delivery device is located shall be the delivery device 2, delivery device 3 and delivery device 4 in descending order of the similarity with the delivery device 1 where the voice function is used again. , when the voice function of the delivery device 1 is used again, it is necessary to detect whether the delivery device 2 is used. If it is not used, when the voice function of the delivery device 1 is activated, the voice acquisition module of the delivery device 2 is activated at the same time, so that the delivery device 2. The audio and noise of the surrounding environment can be collected. If the delivery device 2 is currently in use, it is necessary to detect whether the delivery device 3 is used. If it is not used, when the voice function of the delivery device 1 is activated, simultaneously start the delivery device 3. The voice collection module of the device 3, so that the delivery device 3 can collect the audio and noise of the surrounding environment, and so on, until the target delivery device is obtained.

步骤111、根据所述目标音频信号的音频特征，对该投放设备采集到的音频信号进行音频分离，以得到该投放设备再次使能语音功能时采集到的用户音频。Step 111: Perform audio separation on the audio signal collected by the delivery device according to the audio feature of the target audio signal, so as to obtain the user audio collected when the delivery device enables the voice function again.

具体的，在目标投放设备的语音采集模块采集到由环境中的音频和噪声构成的额目标音频信号后，可以利用目标音频信号的音频特征对该投放设备采集到的音频信号进行音频分离，从而得到该投放设备采集到的用户的用户音频，由于目标投放设备和该投放设备当前所在的环境相似度较高，因此在对该投放设备采集到的音频信号进行音频分离后，可以降低得到的用户音频受周围环境的音频和噪声的影响，从而有利于提高后续在进行语音识别时的准确率。Specifically, after the voice collection module of the target delivery device collects the target audio signal composed of audio and noise in the environment, the audio signal collected by the delivery device can be separated by using the audio features of the target audio signal, thereby The user audio of the user collected by the delivery device is obtained. Since the target delivery device and the environment where the delivery device is currently located are highly similar, after audio separation of the audio signal collected by the delivery device, the obtained user audio can be reduced. Audio is affected by audio and noise in the surrounding environment, which is beneficial to improve the accuracy of subsequent speech recognition.

需要说明的是，关于具体使用的分离方法可以根据实际需要进行设定，在此不做具体限定。It should be noted that the specific separation method used can be set according to actual needs, which is not specifically limited here.

步骤112、根据所述用户音频进行语音识别，得到用户的语音指令。Step 112: Perform voice recognition according to the user audio to obtain the user's voice command.

具体的，在进行语音识别后，可以得到用户音频对应的文本信息，然后通过语义分析确定出用户的用户指令，例如：当语音识别出的文本信息为“我想检测血压”，那么在进行语义分析后可以确定用户此时想要启动血压检测功能，此时可以启动血压检测模块。Specifically, after the speech recognition is performed, the text information corresponding to the user's audio can be obtained, and then the user's user instruction can be determined through semantic analysis. For example, when the text information recognized by the speech is "I want to detect blood pressure", then the semantic After the analysis, it can be determined that the user wants to activate the blood pressure detection function at this time, and the blood pressure detection module can be activated at this time.

当然在确定出用户想要启动的功能后，还可以在显示界面上显示是否启动对应功能的提醒信息，在用户确认后直接启动对应的模块，通过语音的方式来辅助患者进行健康检测可以降低患者使用投放设备的使用难度。Of course, after the function that the user wants to activate is determined, a reminder message on whether to activate the corresponding function can also be displayed on the display interface, and the corresponding module can be activated directly after the user confirms. Difficulty of using the delivery device.

在本申请中，由于在进行语音识别之前降低了周围环境中的音频和噪声对用户的音频的影响，因此有利于提高语音识别结果的准确率，进而有利于提高得到语音指令的准确性。In the present application, since the influence of audio and noise in the surrounding environment on the user's audio is reduced before speech recognition is performed, it is beneficial to improve the accuracy of the speech recognition result, and further to improve the accuracy of obtaining the speech instruction.

同时，相对于通过视频数据识别用户的唇语的识别方式，语音识别具有识别速度快，识别成本低的优点。At the same time, compared with the recognition method of recognizing the user's lip language through video data, speech recognition has the advantages of fast recognition speed and low recognition cost.

在一个可行的实施方案中，在得到所述唇语识别结果和所述语音识别结果后，上述的相似度排序还可以通过以下方式得到：In a feasible embodiment, after obtaining the lip language recognition result and the speech recognition result, the above-mentioned similarity ranking can also be obtained in the following manner:

方式一：依据所述唇语识别结果确定所述语音识别结果的准确率、所述语音识别结果中识别错误的文字在整段语音的各划分区间中的占比，以及所述语音识别结果中连续错字的数量；对于每个所述投放设备，计算该投放设备与其他各投放设备的准确率的第一相似度、占比的第二相似度和连续错字的数量的第三相似度；对该投放设备对应的所述第一相似度、所述第二相似度和所述第三相似度进行加权求和，得到用于表示该投放设备与其他各投放设备受环境和噪声影响的相似度；按照指定顺序对所述相似度进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序。Mode 1: Determine the accuracy of the speech recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the percentage of the speech recognition result according to the lip language recognition result. The number of consecutive typos; for each delivery device, calculate the first similarity of the accuracy of the delivery device and other delivery devices, the second similarity of the proportion, and the third similarity of the number of consecutive typos; The first similarity, the second similarity and the third similarity corresponding to the delivery device are weighted and summed to obtain a similarity that indicates that the delivery device and other delivery devices are affected by the environment and noise ; Sort the similarity according to a specified order, so as to obtain a similarity ranking for indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise.

方式二：对所述唇语识别结果对应的文本信息和所述语音识别结果对应的文本信息进行文本相似度计算，得到所述投放设备对应的文本相似度；对于每个所述投放设备，计算该投放设备的文本相似度与其他投放设备的文本相似度之间的第二差值；按照指定顺序对所述第二差值进行排序，以得到用于表示该投放设备与其他投放设备受周围环境和噪声影响相似情况的相似度排序。Method 2: Perform text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the speech recognition result to obtain the text similarity corresponding to the delivery device; for each delivery device, calculate the second difference between the text similarity of the delivery device and the text similarity of other delivery devices; sort the second difference according to the specified order, so as to obtain a value used to indicate that the delivery device and other delivery devices are affected by surrounding Similarity ranking for similar situations in which environment and noise influences are similar.

在一个可行的实施方案中，图2为本申请实施例提供的另一种语音识别方法的流程示意图，如图2所示，在执行步骤111时，可以通过以下步骤实现：In a feasible implementation, FIG. 2 is a schematic flowchart of another speech recognition method provided by this embodiment of the present application. As shown in FIG. 2 , when step 111 is executed, the following steps can be performed:

步骤201、将所述目标音频信号和该投放设备采集到的音频信号中具有相同采集时刻的音频帧进行对齐，以得到多组具有相同音频采集时刻的音频帧组。Step 201 : Align the target audio signal and the audio frame with the same collection moment in the audio signal collected by the delivery device, so as to obtain multiple groups of audio frame groups with the same audio collection moment.

步骤202、对于每组音频帧，使用该组音频帧中所述目标音频信号对应的音频帧的音频特征，对该投放设备采集到的音频信号对应的音频帧进行音频分离，以得到该音频帧组对应的该投放设备再次使能语音功能时采集到的用户音频帧。Step 202, for each group of audio frames, use the audio feature of the audio frame corresponding to the target audio signal in the group of audio frames, and perform audio separation on the audio frame corresponding to the audio signal collected by the throwing device to obtain the audio frame. User audio frames collected when the delivery device corresponding to the group enables the voice function again.

步骤203、将多个音频帧组对应的用户音频帧进行拼接，得到该投放设备再次使能语音功能时采集到的用户音频。Step 203: Splicing the user audio frames corresponding to the multiple audio frame groups to obtain the user audio collected when the delivery device enables the voice function again.

具体的，由于目标音频信号和该投放设备采集到的音频信号具有相同的录音起始时间点，因此两个投放设备采集到的音频能够在时间上进行对齐，从而将具有相同时间戳的音频帧划分到同一组中，对于同一组中的音频帧，对应的两个投放设备当前所处的环境是比较相似的，在将视频帧对齐后，可以使用目标音频信号对应的音频帧对该投放设备采集到的音频信号中的音频帧进行音频分离，在进行音频分离时，各音频帧组可以并行处理，从而有利于提高数据处理速度。Specifically, since the target audio signal and the audio signal collected by the delivery device have the same recording start time point, the audio collected by the two delivery devices can be aligned in time, so that the audio frames with the same time stamp can be Divided into the same group, for the audio frames in the same group, the current environments of the two corresponding delivery devices are relatively similar. After aligning the video frames, you can use the audio frame corresponding to the target audio signal to the delivery device. The audio frames in the collected audio signals are subjected to audio separation, and during the audio separation, each audio frame group can be processed in parallel, thereby helping to improve the data processing speed.

需要说明的是，在进行音频拼接时具体使用的拼接方式可以根据实际需要进行设定，在此不做具体限定。It should be noted that, the splicing method specifically used when performing audio splicing can be set according to actual needs, which is not specifically limited here.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释，此外，术语“第一”、“第二”、“第三”等仅用于区分描述，而不能理解为指示或暗示相对重要性。It should be noted that like numerals and letters refer to like items in the following figures, so that once an item is defined in one figure, it does not require further definition and explanation in subsequent figures, Furthermore, the terms "first", "second", "third", etc. are only used to differentiate the description and should not be construed as indicating or implying relative importance.

最后应说明的是：以上所述实施例，仅为本申请的具体实施方式，用以说明本申请的技术方案，而非对其限制，本申请的保护范围并不局限于此，尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围。都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以所述权利要求的保护范围为准。Finally, it should be noted that the above-mentioned embodiments are only specific implementations of the present application, and are used to illustrate the technical solutions of the present application, rather than limit them. The embodiments describe the application in detail, and those of ordinary skill in the art should understand that: any person skilled in the art can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the application. Changes can be easily conceived, or equivalent replacements are made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application. All should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. a speech recognition method, is characterized in that, described method comprises:

Within the preset time period, when the delivery device enables the voice function, the voice acquisition module and the video data acquisition module are simultaneously activated to collect audio data and video data with the same time attribute;

The four-point labeling method is used to label the two corners of the lips and the lips of the upper and lower lips collected in the video data;

Measure the opening angle of the mouth corner, the distance between the upper and lower lips, and the distance between the two mouth corners when the mouth shape changes each time in the video data, so as to obtain multiple sets of measurement data, wherein the mouth corner opening angle Including a first included angle and a second included angle, the first included angle is the included angle between the connection line between the upper lip lip and the corner of the mouth and the connecting line between the two mouth corners, and the second included angle is the lower lip lip The angle between the line between the center and the corners of the mouth and the line between the two corners of the mouth;

The mouth opening angle, the distance and the distance in each set of measurement data are input into the evaluation function as input parameters to obtain lip change data, wherein the evaluation function is: Pre=k*(p* Angle(a,b)+q*Line(LAF,LDE)), k, p, q are the weight coefficients of each cost function, Angle(a,b) is the first angle and the second angle The cost function of , Line(LAF, LDE) is the cost function of the distance and the distance;

Using the lip change data and the lip shape database to compare and obtain a lip language recognition result, and carry out audio recognition to the audio data to obtain a speech recognition result;

According to the lip language recognition result, determine the accuracy rate of the speech recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the percentage of consecutive typos in the speech recognition result. quantity;

Weighted summation is performed on the accuracy rate, the proportion and the quantity to obtain a speech recognition feature value representing the speech recognition feature of the delivery device;

After obtaining the speech recognition feature value of each of the delivery devices, for each of the delivery devices, calculate the first difference between the speech recognition feature values of other delivery devices and the speech recognition feature value of the delivery device;

Sorting the first difference values according to a specified order to obtain a similarity ranking indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise;

After the similarity ranking corresponding to the delivery device is obtained, when the voice function is enabled on the delivery device again, a target delivery device is selected from other delivery devices according to the similarity ranking corresponding to the delivery device, and the target delivery device is activated. The voice collection module collects the target audio signal, wherein the target delivery device is the delivery device with the highest similarity ranking among the delivery devices not used by the user when the delivery device enables the voice function again;

According to the audio feature of the target audio signal, audio separation is performed on the audio signal collected by the delivery device, so as to obtain the user audio collected when the delivery device enables the voice function again;

Perform voice recognition according to the user audio to obtain the user's voice command.

2. The method of claim 1, wherein, after obtaining the lip language recognition result and the speech recognition result, the similarity ranking can also be obtained by the following steps:

According to the lip language recognition result, determine the accuracy rate of the speech recognition result, the proportion of incorrectly recognized characters in the speech recognition result in each division of the entire speech, and the percentage of consecutive typos in the speech recognition result. Quantity; for each delivery device, calculate the first similarity of the accuracy rate, the second similarity of the proportion, and the third similarity of the number of consecutive typos between the delivery device and other delivery devices; The corresponding first similarity, the second similarity and the third similarity are weighted and summed to obtain the similarity used to indicate that the delivery device and other delivery devices are affected by the environment and noise; according to the specified Ranking the similarity in order to obtain a similarity ranking indicating that the delivery device and other delivery devices are similarly affected by the surrounding environment and noise; or,

Perform text similarity calculation on the text information corresponding to the lip language recognition result and the text information corresponding to the speech recognition result to obtain the text similarity corresponding to the delivery device; for each delivery device, calculate the delivery device The second difference value between the text similarity of , and the text similarity of other delivery devices; sort the second difference according to the specified order to obtain a value used to represent that the delivery device and other delivery devices are affected by the surrounding environment and noise. Affects similarity ranking for similar situations.

3. The method of claim 1, wherein, according to the audio feature of the target audio signal, audio separation is performed on the audio signal collected by the throwing device, so as to obtain when the throwing device enables the voice function again. to the user audio, including:

Aligning the audio frame with the same acquisition moment in the audio signal collected by the target audio signal and the throwing device, to obtain multiple groups of audio frame groups with the same audio acquisition moment;

For each group of audio frames, use the audio feature of the audio frame corresponding to the target audio signal in the group of audio frames to perform audio separation on the audio frame corresponding to the audio signal collected by the delivery device to obtain the corresponding audio frame group. User audio frames collected when the delivery device enables the voice function again;

The user audio frames corresponding to the multiple audio frame groups are spliced to obtain the user audio collected when the delivery device enables the voice function again.