CN115331673B

CN115331673B - Voiceprint recognition household appliance control method and device in complex sound scene

Info

Publication number: CN115331673B
Application number: CN202211256541.1A
Authority: CN
Inventors: 张林焘; 吴昊; 别荣芳
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-01-03
Anticipated expiration: 2042-10-14
Also published as: CN115331673A

Abstract

The invention provides a voiceprint recognition household appliance control method and device in a complex sound scene, and relates to the field of household appliance control. The template audio fully considers various conditions in a complex sound scene, has better representativeness, and lays a foundation for improving the voiceprint recognition precision in the complex sound scene. And the similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, so that the voiceprint recognition precision is improved. The model is simple to complex, the audio frequency which is easy to judge can obtain the result by using the simple model, the audio frequency signal which is difficult to judge can obtain the result by using the complex model, and the consumption of computing resources is reduced.

Description

A voiceprint recognition home appliance control method and device in a complex sound scene

技术领域technical field

本发明涉及家电控制领域，具体而言，涉及一种复杂声音场景下的声纹识别家电控制方法和装置。The present invention relates to the field of home appliance control, in particular to a voiceprint recognition home appliance control method and device in a complex sound scene.

背景技术Background technique

随着科技的进步，越来越多的现代化家电被消费者广泛应用。作为重要的身份识别技术，声纹识别可以对家庭成员的身份进行识别，从而让家电接受特定家庭成员的指令，防止无关人员的指令干扰。通常情况下，普通的声纹识别技术已经可以保证较高的识别精准度，从而实现特定家庭成员对家电的精准控制。With the advancement of science and technology, more and more modern household appliances are widely used by consumers. As an important identification technology, voiceprint recognition can identify the identities of family members, so that home appliances can accept instructions from specific family members and prevent interference from instructions from unrelated people. Under normal circumstances, ordinary voiceprint recognition technology can already guarantee high recognition accuracy, so as to realize precise control of home appliances by specific family members.

然而，在利用声纹识别技术对家电进行控制的过程中，往往伴随着复杂的声音场景，极大地降低了声纹识别技术的识别精度。随着识别精度的显著降低，基于声纹识别控制方法的家电应用价值也显著下降。因此，如何设计一种复杂声音场景下的声纹识别家电控制方法，在复杂声音场景下也能保证声纹识别的精准度有非常重要的应用价值。However, in the process of using voiceprint recognition technology to control home appliances, it is often accompanied by complex sound scenes, which greatly reduces the recognition accuracy of voiceprint recognition technology. With the significant reduction in recognition accuracy, the application value of home appliances based on voiceprint recognition control methods has also declined significantly. Therefore, how to design a voiceprint recognition home appliance control method in a complex sound scene, which can also ensure the accuracy of voiceprint recognition in a complex sound scene, has very important application value.

发明内容Contents of the invention

为了克服上述问题或者至少部分地解决上述问题，本发明实施例提供一种复杂声音场景下的声纹识别家电控制方法和装置。In order to overcome the above-mentioned problems or at least partially solve the above-mentioned problems, embodiments of the present invention provide a voiceprint recognition home appliance control method and device in a complex sound scene.

本发明的实施例是这样实现的：Embodiments of the present invention are achieved like this:

第一方面，本法实施例提供一种复杂声音场景下的声纹识别家电控制方法，包括：In the first aspect, the embodiment of this law provides a voiceprint recognition home appliance control method in a complex sound scene, including:

在多个声音场景下，分别录入特定家庭成员的多段音频；In multiple sound scenes, record multiple audio segments of specific family members;

对多段音频进行编码；Encode multiple audio segments;

编码之后，计算每位家庭成员的音频两两之间的相似性，保留一段相似度大于预设值的音频，并将保留下的所有音频认定为模板音频；After encoding, calculate the similarity between the audios of each family member, retain a piece of audio whose similarity is greater than the preset value, and identify all retained audios as template audios;

将所有模板音频作为正训练样本，并收集多个非特定家庭成员的音频作为负训练样本，利用机器学习模型进行训练，得到声纹识别决策模型；Use all template audio as positive training samples, and collect audio from multiple non-specific family members as negative training samples, use machine learning models for training, and obtain a voiceprint recognition decision model;

当家电使用人输出一段音频，计算该段音频和模板音频的相似性，若该段音频和任意模板音频的相似性大于预设相似度，直接识别为特定家庭成员的音频；若该段音频和任意模板音频的相似性均小于预设相似度，则进行下一步；When the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio. If the similarity between the piece of audio and any template audio is greater than the preset similarity, it will be directly recognized as the audio of a specific family member; If the similarity of any template audio is less than the preset similarity, proceed to the next step;

利用声纹识别决策模型对家电使用人的输出音频进行判断是否为特定家庭成员的音频。The voiceprint recognition decision model is used to judge whether the output audio of the home appliance user is the audio of a specific family member.

基于第一方面，在本发明的一些实施例中，上述机器学习模型为SVM模型。Based on the first aspect, in some embodiments of the present invention, the above machine learning model is an SVM model.

基于第一方面，在本发明的一些实施例中，上述利用声纹识别决策模型对家电使用人的输出音频进行判断是否为特定家庭成员的音频的步骤包括：Based on the first aspect, in some embodiments of the present invention, the above step of using the voiceprint recognition decision-making model to determine whether the output audio of the household appliance user is the audio of a specific family member includes:

若基于SVM模型的声纹识别决策结果得分大于第一预设分数，直接识别为特定家庭成员的音频，若基于SVM模型的声纹识别决策结果得分小于第二预设分数，直接识别为非特定家庭成员的音频，若基于SVM模型的声纹识别决策结果得分在第一预设分数与第二预设分数之间，则进行下一步；If the score of the voiceprint recognition decision result based on the SVM model is greater than the first preset score, it is directly recognized as the audio of a specific family member; if the score of the voiceprint recognition decision result based on the SVM model is less than the second preset score, it is directly recognized as non-specific The audio of the family member, if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, then proceed to the next step;

利用基于卷积神经网络的声纹识别模型对家电使用人的输出音频进行最终判定，判断是否为特定家庭成员的音频。The voiceprint recognition model based on the convolutional neural network is used to make a final judgment on the output audio of the home appliance user, and judge whether it is the audio of a specific family member.

基于第一方面，在本发明的一些实施例中，上述当家电使用人输出一段音频，计算该段音频和模板音频的相似性的步骤包括：Based on the first aspect, in some embodiments of the present invention, when the above-mentioned home appliance user outputs a piece of audio, the step of calculating the similarity between the piece of audio and the template audio includes:

对该段音频和模板音频进行：音频滤波、计算音频信号短时能量、截取音频信号有效数据；Carry out the following audio and template audio: audio filtering, calculating the short-term energy of the audio signal, and intercepting the effective data of the audio signal;

计算该段音频和模板音频的余弦距离。Calculate the cosine distance between the audio and template audio.

基于第一方面，在本发明的一些实施例中，上述在多个声音场景下，分别录入特定家庭成员的多段音频的步骤包括：Based on the first aspect, in some embodiments of the present invention, the above-mentioned steps of separately recording multiple pieces of audio of specific family members in multiple sound scenes include:

在高噪音、多人说话、声音较小的一种或多种情况下录入特定家庭成员的多段音频；Record multiple audio clips of specific family members in one or more situations of high noise, multiple people talking, and low voice;

录入时，控制每段音频的时长在5秒之内。When recording, control the duration of each audio segment within 5 seconds.

基于第一方面，在本发明的一些实施例中，上述对多段音频进行编码的步骤包括：Based on the first aspect, in some embodiments of the present invention, the above step of encoding multiple segments of audio includes:

利用I-Vector计算方法对多段音频进行编码。Use the I-Vector calculation method to encode multi-segment audio.

基于第一方面，在本发明的一些实施例中，收集多个非特定家庭成员的音频作为负训练样本的步骤包括：Based on the first aspect, in some embodiments of the present invention, the step of collecting audios of a plurality of non-specific family members as negative training samples includes:

收集50个以上的非特定家庭成员的音频作为负训练样本。More than 50 audios of unspecified family members are collected as negative training samples.

第二方面，本发明实施例提供一种复杂声音场景下的声纹识别家电控制系统，包括：In the second aspect, an embodiment of the present invention provides a voiceprint recognition home appliance control system in a complex sound scene, including:

录入模块：在多个声音场景下，分别录入特定家庭成员的多段音频；Recording module: in multiple sound scenes, record multiple pieces of audio of specific family members;

编码模块：对多段音频进行编码；Encoding module: encode multi-segment audio;

计算相似性模块：编码之后，计算每位家庭成员的音频两两之间的相似性，保留一段相似度大于预设值的音频，并将保留下的所有音频认定为模板音频；Calculation similarity module: After encoding, calculate the similarity between each family member's audio pairwise, retain a piece of audio whose similarity is greater than the preset value, and identify all retained audio as template audio;

训练模块：将所有模板音频作为正训练样本，并收集多个非特定家庭成员的音频作为负训练样本，利用机器学习模型进行训练，得到声纹识别决策模型；Training module: use all template audio as positive training samples, and collect audio from multiple non-specific family members as negative training samples, use machine learning model for training, and obtain voiceprint recognition decision-making model;

识别模块：当家电使用人输出一段音频，计算该段音频和模板音频的相似性，若该段音频和任意模板音频的相似性大于预设相似度，直接识别为特定家庭成员的音频；Recognition module: when the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio, and if the similarity between the piece of audio and any template audio is greater than the preset similarity, it is directly recognized as the audio of a specific family member;

判断模块：若该段音频和任意模板音频的相似性均小于预设相似度，利用声纹识别决策模型对家电使用人的输出音频进行判断是否为特定家庭成员的音频。Judgment module: If the similarity between this segment of audio and any template audio is less than the preset similarity, use the voiceprint recognition decision-making model to judge whether the output audio of the home appliance user is the audio of a specific family member.

第三方面，本发明实施例提供一种电子设备，包括：In a third aspect, an embodiment of the present invention provides an electronic device, including:

至少一个处理器、至少一个存储器和数据总线；其中：at least one processor, at least one memory, and a data bus; wherein:

上述处理器与上述存储器通过上述数据总线完成相互间的通信；上述存储器存储有可被上述处理器执行的程序指令，上述处理器调用上述程序指令以执行上述的方法。The processor and the memory communicate with each other through the data bus; the memory stores program instructions executable by the processor, and the processor invokes the program instructions to execute the above method.

第四方面，本发明实施例提供一种非暂态计算机可读存储介质，上述非暂态计算机可读存储介质存储计算机程序，上述计算机程序使计算机执行上述的方法。In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores a computer program, and the computer program enables a computer to execute the above-mentioned method.

相对于现有技术，本发明的实施例至少具有如下优点或有益效果：Compared with the prior art, the embodiments of the present invention have at least the following advantages or beneficial effects:

（1）模板音频充分考虑了复杂声音场景下的多种情况，具有较好的代表性，为提升复杂声音场景下的声纹识别精度奠定了基础。(1) Template audio fully considers a variety of situations in complex sound scenes, and has good representativeness, laying a foundation for improving the accuracy of voiceprint recognition in complex sound scenes.

（2）利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型、基于卷积神经网络的声纹识别模型依次进行判断，提升了声纹识别的精度。(2) The similarity detection model based on template audio, the voiceprint recognition decision model based on SVM model, and the voiceprint recognition model based on convolutional neural network are used to make judgments in sequence, which improves the accuracy of voiceprint recognition.

（3）利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型、基于卷积神经网络的声纹识别模型依次进行判断，模型由简单到复杂，容易判断的音频利用简单模型即可得到结果，难以判断的音频信号再用复杂模型得到结果，降低了计算资源消耗。(3) Use the similarity detection model based on template audio, the voiceprint recognition decision model based on SVM model, and the voiceprint recognition model based on convolutional neural network to make judgments in sequence. The models range from simple to complex, and the audio that is easy to judge uses a simple model The result can be obtained, and the audio signal that is difficult to judge can be obtained by using a complex model, which reduces the consumption of computing resources.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention, and thus It should be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings based on these drawings without creative work.

图1为本发明一种复杂声音场景下的声纹识别家电控制方法置一实施例流程图；Fig. 1 is a flow chart of an embodiment of a voiceprint recognition home appliance control method in a complex sound scene according to the present invention;

图2为本发明一种复杂声音场景下的声纹识别家电控制方法一实施例的流程图；Fig. 2 is a flowchart of an embodiment of a control method for voiceprint recognition household appliances in a complex sound scene according to the present invention;

图3为本发明一种复杂声音场景下的声纹识别家电控制装置一实施例的结构框图；3 is a structural block diagram of an embodiment of a voiceprint recognition home appliance control device in a complex sound scene according to the present invention;

图4为本发明一种电子设备一实施例的结构框图。FIG. 4 is a structural block diagram of an embodiment of an electronic device according to the present invention.

图标：1、录入模块；2、编码模块；3、计算相似性模块；4、训练模块；5、识别模块；6、判断模块；7、处理器；8、存储器；9、数据总线。Icons: 1. Input module; 2. Coding module; 3. Calculation similarity module; 4. Training module; 5. Recognition module; 6. Judgment module; 7. Processor; 8. Memory; 9. Data bus.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. The components of the embodiments of the invention generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations.

因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Accordingly, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

在本申请所提供的实施例中，应该理解到，所揭露的方法和装置，也可以通过其它的方式实现。系统实施例仅仅是示意性的，例如，附图中的框图显示了根据本申请的多个实施例的系统和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，框图中的每个方框可以代表一个模块、程序段或代码的一部分，模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。In the embodiments provided in this application, it should be understood that the disclosed methods and devices may also be implemented in other ways. The system embodiments are illustrative only, for example, the block diagrams in the figures show the architecture, functionality and operation of possible implementations of systems and computer program products according to various embodiments of the present application. In this regard, each block in the block diagram may represent a module, program segment, or part of code that includes one or more executable instructions for implementing specified logical functions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

另外，在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, each functional module in each embodiment of the present application may be integrated to form an independent part, each module may exist independently, or two or more modules may be integrated to form an independent part.

功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备，可以是个人计算机，服务器，或者网络设备等，执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器（ROM，Read-Only Memory）、随机存取存储器（RAM，Random Access Memory）、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device, which may be a personal computer, a server, or a network device, etc., execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, and other media that can store program codes. .

在本发明实施例的描述中，“多个”代表至少2个。In the description of the embodiments of the present invention, "multiple" means at least 2.

在本发明实施例的描述中，还需要说明的是，除非另有明确的规定和限定，若出现术语“设置”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以根据具体情况理解上述术语在本发明中的具体含义。In the description of the embodiments of the present invention, it should also be noted that, unless otherwise specified and limited, the terms "setting" and "connection" should be interpreted in a broad sense, for example, it can be a fixed connection or an optional connection. Detachable connection, or integral connection; it can be mechanical connection or electrical connection; it can be direct connection or indirect connection through an intermediary, and it can be the internal communication of two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention according to specific situations.

实施例Example

请参照图1，第一方面，本法实施例提供一种复杂声音场景下的声纹识别家电控制方法，包括：Please refer to Figure 1. In the first aspect, the embodiment of this law provides a voiceprint recognition home appliance control method in a complex sound scene, including:

S1：在多个声音场景下，分别录入特定家庭成员的多段音频；S1: In multiple sound scenes, record multiple audio segments of specific family members;

该步骤中，多个声音场景即复制场景，包括高噪音、多人说话、声音较小等多种情况，尽可能全面包含家电使用中声纹识别的所有情况，音频时长可根据实际情况进行设定，通常情况下对家电进行语音控制的时间较短，每段音频5秒之内即可。模板音频充分考虑了复杂声音场景下的多种情况，具有较好的代表性，为提升复杂声音场景下的声纹识别精度奠定了基础。In this step, multiple sound scenes are duplication scenes, including high noise, many people talking, low voice, etc., and cover all the situations of voiceprint recognition in the use of home appliances as comprehensively as possible. The audio duration can be set according to the actual situation. Generally, the time for voice control of home appliances is relatively short, and each audio segment can be within 5 seconds. Template audio fully considers a variety of situations in complex sound scenes, and is representative, laying the foundation for improving the accuracy of voiceprint recognition in complex sound scenes.

S2：对多段音频进行编码；S2: encode multiple segments of audio;

该步骤中，利用I-Vector计算方法对多段音频进行编码。在实际应用中，由于说话人语音中说话人信息和各种干扰信息掺杂在一起，不同的采集设备的信道之间也具有差异性，会使我们收集到的语音中掺杂信道干扰信息。这种干扰信息会引起说话人信息的扰动。传统的GMM-UBM方法，没有办法克服这一问题，导致系统性能不稳定。在GMM-UBM模型里，每个目标说话人都可以用GMM模型来描述。因为从UBM模型自适应到每个说话人的GMM模型时，只改变均值，对于权重和协方差不做任何调整，所以说话人的信息大部分都蕴含在GMM的均值里面。GMM均值矢量中，除了绝大部分的说话人信息之外，也包含了信道信息。联合因子分析(JointFactorAnalysis,JFA)可以对说话人差异和信道差异分别建模，从而可以很好的对信道差异进行补偿，提高系统表现。但由于JFA需要大量不同通道的训练语料，获取困难，并且计算复杂，所以难以投入实际使用。由Dehak提出的，基于I-Vector因子分析技术，提出了全新的解决方法。JFA方法是对说话人差异空间以与信道差异空间分别建模，而基于I-Vector的方法是对全局差异进行建模，将其二者作为一个整体进行建模，这样处理放宽了对训练语料的限制，并且计算简单，性能也相当。In this step, the multi-segment audio is encoded using the I-Vector calculation method. In practical applications, because the speaker information and various interference information are mixed together in the speaker's voice, the channels of different acquisition devices also have differences, which will make the collected voice doped with channel interference information. This noise information will cause the disturbance of speaker information. The traditional GMM-UBM method has no way to overcome this problem, resulting in unstable system performance. In the GMM-UBM model, each target speaker can be described by the GMM model. Since the UBM model is adapted to the GMM model of each speaker, only the mean value is changed, and no adjustments are made to the weight and covariance, so most of the speaker information is contained in the mean value of the GMM. In addition to most of the speaker information, the GMM mean vector also contains channel information. Joint Factor Analysis (JFA) can model speaker differences and channel differences separately, so that channel differences can be well compensated and system performance can be improved. However, since JFA requires a large amount of training corpus with different channels, it is difficult to obtain and the calculation is complex, so it is difficult to put it into practical use. Proposed by Dehak, based on I-Vector factor analysis technology, a new solution is proposed. The JFA method is to model the speaker difference space separately from the channel difference space, while the I-Vector-based method is to model the global difference and model the two as a whole, which relaxes the training corpus The limit, and the calculation is simple, the performance is also comparable.

S3：编码之后，计算每位家庭成员的音频两两之间的相似性，保留一段相似度大于预设值的音频，并将保留下的所有音频认定为模板音频；S3: After encoding, calculate the similarity between the audios of each family member, keep a piece of audio whose similarity is greater than the preset value, and identify all the retained audios as template audios;

该步骤中，计算每位家庭成员的音频两两之间的相似性包括分别对两段音频进行音频滤波、计算音频信号短时能量、截取音频信号有效数据；计算两段音频的余弦距离。保留一段相似度大于预设值的音频，并将保留下的所有音频认定为模板音频，预设值可根据实际需求进行合理设置即可。In this step, calculating the similarity between the audios of each family member includes performing audio filtering on the two pieces of audio, calculating the short-term energy of the audio signal, intercepting the effective data of the audio signal, and calculating the cosine distance of the two pieces of audio. Reserve a piece of audio whose similarity is greater than the preset value, and identify all the retained audio as template audio. The preset value can be set reasonably according to actual needs.

S4：将所有模板音频作为正训练样本，并收集多个非特定家庭成员的音频作为负训练样本，利用机器学习模型进行训练，得到声纹识别决策模型；S4: Use all template audio as positive training samples, and collect audio from multiple non-specific family members as negative training samples, use the machine learning model for training, and obtain a voiceprint recognition decision model;

该步骤中，收集多个非特定家庭成员的音频作为负训练样本的步骤包括：收集50个以上的非特定家庭成员的音频作为负训练样本。上述机器学习模型可以为SVM模型。In this step, the step of collecting audios of multiple non-specific family members as negative training samples includes: collecting audios of more than 50 non-specific family members as negative training samples. The above machine learning model may be an SVM model.

S5：当家电使用人输出一段音频，计算该段音频和模板音频的相似性，若该段音频和任意模板音频的相似性大于预设相似度，直接识别为特定家庭成员的音频；若该段音频和任意模板音频的相似性均小于预设相似度，则进行下一步；S5: When the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio, if the similarity between the piece of audio and any template audio is greater than the preset similarity, it is directly recognized as the audio of a specific family member; If the similarity between the audio and any template audio is less than the preset similarity, proceed to the next step;

该步骤中，可以利用基于模板音频的相似性检测模型计算该段音频和模板音频的相似性。上述当家电使用人输出一段音频，计算该段音频和模板音频的相似性的步骤包括：对该段音频和模板音频进行：音频滤波、计算音频信号短时能量、截取音频信号有效数据；计算该段音频和模板音频的余弦距离。In this step, a template audio-based similarity detection model can be used to calculate the similarity between the segment of audio and the template audio. When the above-mentioned home appliance user outputs a piece of audio, the steps of calculating the similarity between the piece of audio and the template audio include: performing audio filtering on the piece of audio and the template audio, calculating the short-term energy of the audio signal, and intercepting the effective data of the audio signal; Cosine distance between segment audio and template audio.

S6：利用声纹识别决策模型对家电使用人的输出音频进行判断是否为特定家庭成员的音频。S6: Use the voiceprint recognition decision-making model to judge whether the output audio of the household appliance user is the audio of a specific family member.

利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型依次进行判断，提升了声纹识别的精度；模型由简单到复杂，容易判断的音频利用简单模型即可得到结果，难以判断的音频信号再用复杂模型得到结果，降低了计算资源消耗。The similarity detection model based on the template audio and the voiceprint recognition decision model based on the SVM model are used to make judgments in sequence, which improves the accuracy of voiceprint recognition; the model is from simple to complex, and the audio that is easy to judge can use a simple model to get the result, but it is difficult The judged audio signal is then used to obtain the result with a complex model, which reduces the consumption of computing resources.

请参照图2，S61：若基于SVM模型的声纹识别决策结果得分大于第一预设分数，直接识别为特定家庭成员的音频，若基于SVM模型的声纹识别决策结果得分小于第二预设分数，直接识别为非特定家庭成员的音频，若基于SVM模型的声纹识别决策结果得分在第一预设分数与第二预设分数之间，则进行下一步；Please refer to Figure 2, S61: If the score of the voiceprint recognition decision result based on the SVM model is greater than the first preset score, directly recognize the audio of a specific family member; if the score of the voiceprint recognition decision result based on the SVM model is less than the second preset score Score, directly identified as the audio of a non-specific family member, if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, proceed to the next step;

S62：利用基于卷积神经网络的声纹识别模型对家电使用人的输出音频进行最终判定，判断是否为特定家庭成员的音频。S62: Use the voiceprint recognition model based on the convolutional neural network to make a final judgment on the output audio of the household appliance user, and judge whether it is the audio of a specific family member.

利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型、基于卷积神经网络的声纹识别模型依次进行判断，提升了声纹识别的精度；利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型、基于卷积神经网络的声纹识别模型依次进行判断，模型由简单到复杂，容易判断的音频利用简单模型即可得到结果，难以判断的音频信号再用复杂模型得到结果，降低了计算资源消耗。The similarity detection model based on template audio, the voiceprint recognition decision model based on SVM model, and the voiceprint recognition model based on convolutional neural network are used to make judgments in sequence, which improves the accuracy of voiceprint recognition; using similarity detection based on template audio The model, the voiceprint recognition decision model based on the SVM model, and the voiceprint recognition model based on the convolutional neural network are judged in turn. The models range from simple to complex. The audio that is easy to judge can use the simple model to get the result, and the audio signal that is difficult to judge can be obtained again. Results are obtained with complex models, reducing computational resource consumption.

请参照图3，第二方面，本发明实施例提供一种复杂声音场景下的声纹识别家电控制系统，包括：Please refer to Figure 3. In the second aspect, the embodiment of the present invention provides a voiceprint recognition home appliance control system in a complex sound scene, including:

录入模块1：在多个声音场景下，分别录入特定家庭成员的多段音频；Input module 1: In multiple sound scenes, record multiple audio segments of specific family members;

编码模块2：对多段音频进行编码；Encoding module 2: Encoding multi-segment audio;

计算相似性模块3：编码之后，计算每位家庭成员的音频两两之间的相似性，保留一段相似度大于预设值的音频，并将保留下的所有音频认定为模板音频；Computational similarity module 3: After encoding, calculate the similarity between the audios of each family member, retain a piece of audio whose similarity is greater than the preset value, and identify all retained audios as template audios;

训练模块4：将所有模板音频作为正训练样本，并收集多个非特定家庭成员的音频作为负训练样本，利用机器学习模型进行训练，得到声纹识别决策模型；Training module 4: Use all template audio as positive training samples, and collect audio from multiple non-specific family members as negative training samples, use machine learning model for training, and obtain voiceprint recognition decision-making model;

识别模块5：当家电使用人输出一段音频，计算该段音频和模板音频的相似性，若该段音频和任意模板音频的相似性大于预设相似度，直接识别为特定家庭成员的音频；Recognition module 5: When the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio, and if the similarity between the piece of audio and any template audio is greater than the preset similarity, directly identify it as the audio of a specific family member;

判断模块6：若该段音频和任意模板音频的相似性均小于预设相似度，利用声纹识别决策模型对家电使用人的输出音频进行判断是否为特定家庭成员的音频。Judgment module 6: If the similarity between the segment of audio and any template audio is less than the preset similarity, use the voiceprint recognition decision model to judge whether the output audio of the home appliance user is the audio of a specific family member.

该装置的具体实施方式，请参考上述方法的实施方式，在此不再过多赘述。For the specific implementation manner of the device, please refer to the implementation manner of the above-mentioned method, and details will not be repeated here.

请参照图4，第三方面，本发明实施例提供一种电子设备，包括：Please refer to FIG. 4. In a third aspect, an embodiment of the present invention provides an electronic device, including:

至少一个处理器7、至少一个存储器8和数据总线9；其中：at least one processor 7, at least one memory 8 and a data bus 9; wherein:

上述处理器7与上述存储器8通过上述数据总线9完成相互间的通信；上述存储器8存储有可被上述处理器7执行的程序指令，上述处理器7调用上述程序指令以执行上述的方法。例如执行上述步骤S1-S6。The processor 7 and the memory 8 communicate with each other through the data bus 9; the memory 8 stores program instructions executable by the processor 7, and the processor 7 invokes the program instructions to execute the above method. For example, the above steps S1-S6 are performed.

第四方面，本发明实施例提供一种非暂态计算机可读存储介质，上述非暂态计算机可读存储介质存储计算机程序，上述计算机程序使计算机执行上述的方法。例如执行上述步骤S1-S6。In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores a computer program, and the computer program enables a computer to execute the above-mentioned method. For example, the above steps S1-S6 are performed.

综上，本发明提供一种复杂声音场景下的声纹识别家电控制方法，模板音频充分考虑了复杂声音场景下的多种情况，具有较好的代表性，为提升复杂声音场景下的声纹识别精度奠定了基础。利用基于模板音频的相似性检测模型、基于SVM模型的声纹识别决策模型、基于卷积神经网络的声纹识别模型依次进行判断，提升了声纹识别的精度。模型由简单到复杂，容易判断的音频利用简单模型即可得到结果，难以判断的音频信号再用复杂模型得到结果，降低了计算资源消耗。To sum up, the present invention provides a voiceprint recognition home appliance control method in a complex sound scene. The template audio fully considers various situations in a complex sound scene, and is relatively representative. Recognition accuracy lays the foundation. The similarity detection model based on template audio, the voiceprint recognition decision model based on SVM model, and the voiceprint recognition model based on convolutional neural network are used to make judgments in sequence, which improves the accuracy of voiceprint recognition. The model ranges from simple to complex. Audio that is easy to judge can be obtained by using a simple model, and audio signals that are difficult to judge can be obtained by using a complex model, which reduces the consumption of computing resources.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

对于本领域技术人员而言，显然本申请不限于上述示范性实施例的细节，而且在不背离本申请的精神或基本特征的情况下，能够以其它的具体形式实现本申请。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本申请的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化囊括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。It will be apparent to those skilled in the art that the present application is not limited to the details of the exemplary embodiments described above, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application. Therefore, the embodiments should be regarded as exemplary and not restrictive in all points of view, and the scope of the application is defined by the appended claims rather than the foregoing description, and it is intended that the scope of the present application be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalents of the elements are embraced in this application. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A voiceprint recognition home appliance control method in a complex sound scene, characterized in that it comprises:

In multiple sound scenes, record multiple audio segments of specific family members;

Encode multiple audio segments;

After encoding, calculate the similarity between the audios of each family member, retain a piece of audio whose similarity is greater than the preset value, and identify all retained audios as template audios;

All template audios are used as positive training samples, and the audios of multiple non-specific family members are collected as negative training samples, and a machine learning model is used for training to obtain a voiceprint recognition decision model, and the machine learning model is an SVM model;

When the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio. If the similarity between the piece of audio and any template audio is greater than the preset similarity, it will be directly recognized as the audio of a specific family member; If the similarity of any template audio is less than the preset similarity, proceed to the next step;

Use the voiceprint recognition decision-making model to judge whether the output audio of the home appliance user is the audio of a specific family member;

The step of using the voiceprint recognition decision-making model to determine whether the output audio of the household appliance user is the audio of a specific family member includes:

If the score of the voiceprint recognition decision result based on the SVM model is greater than the first preset score, it is directly recognized as the audio of a specific family member; if the score of the voiceprint recognition decision result based on the SVM model is less than the second preset score, it is directly recognized as non-specific The audio of the family member, if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, then proceed to the next step;

The voiceprint recognition model based on the convolutional neural network is used to make a final judgment on the output audio of the home appliance user, and judge whether it is the audio of a specific family member.

2. The voiceprint recognition home appliance control method in a complex sound scene according to claim 1, wherein when the user of the home appliance outputs a piece of audio, the step of calculating the similarity between the piece of audio and the template audio includes :

Carry out the following audio and template audio: audio filtering, calculating the short-term energy of the audio signal, and intercepting the effective data of the audio signal;

Calculate the cosine distance between the audio and template audio.

3. The method for controlling home appliances based on voiceprint recognition in a complex sound scene according to claim 1, wherein the step of respectively recording multiple audio segments of specific family members in multiple sound scenes comprises:

Record multiple audio clips of specific family members in one or more situations of high noise, multiple people talking, and low voice;

When recording, control the duration of each audio segment within 5 seconds.

4. The voiceprint recognition home appliance control method in a complex sound scene according to claim 1, wherein the step of encoding multiple segments of audio comprises:

Use the I-Vector calculation method to encode multi-segment audio.

5. The voiceprint recognition home appliance control method in a complex sound scene according to claim 1, wherein the step of collecting the audio of a plurality of non-specific family members as negative training samples comprises:

More than 50 audios of unspecified family members are collected as negative training samples.

6. A voiceprint recognition home appliance control device in a complex sound scene, characterized in that it comprises:

Recording module: in multiple sound scenes, record multiple pieces of audio of specific family members;

Encoding module: encode multi-segment audio;

Calculation similarity module: After encoding, calculate the similarity between each family member's audio pairwise, retain a piece of audio whose similarity is greater than the preset value, and identify all retained audio as template audio;

Training module: all template audios are used as positive training samples, and the audios of multiple non-specific family members are collected as negative training samples, and a machine learning model is used for training to obtain a voiceprint recognition decision model, and the machine learning model is an SVM model;

Recognition module: when the home appliance user outputs a piece of audio, calculate the similarity between the piece of audio and the template audio, and if the similarity between the piece of audio and any template audio is greater than the preset similarity, it is directly recognized as the audio of a specific family member;

Judgment module: if the similarity between the audio and any template audio is less than the preset similarity, use the voiceprint recognition decision-making model to judge whether the output audio of the home appliance user is the audio of a specific family member;

The judgment module includes:

Recognition sub-module: if the score of the voiceprint recognition decision result based on the SVM model is greater than the first preset score, directly recognize the audio of a specific family member; if the score of the voiceprint recognition decision result based on the SVM model is less than the second preset score, directly If the audio is identified as a non-specific family member, if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, proceed to the next step;

Final judgment sub-module: Use the convolutional neural network-based voiceprint recognition model to make a final judgment on the output audio of the home appliance user, and judge whether it is the audio of a specific family member.

7. An electronic device, characterized in that it comprises:

at least one processor, at least one memory, and a data bus; wherein:

The processor and the memory communicate with each other through the data bus; the memory stores program instructions that can be executed by the processor, and the processor calls the program instructions to perform the procedure described in claim 1. to any of the methods described in 5.

8. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores a computer program, and the computer program causes the computer to execute the method according to any one of claims 1 to 5 .