CN117012204B

CN117012204B - Defensive method for countermeasure sample of speaker recognition system

Info

Publication number: CN117012204B
Application number: CN202310918349.2A
Authority: CN
Inventors: 徐洋; 杨凌一; 张思聪; 谢晓尧
Original assignee: Guizhou Education University
Current assignee: Guizhou Education University
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2024-04-09
Anticipated expiration: 2043-07-25
Also published as: CN117012204A

Abstract

The invention discloses a defending method for a speaker recognition system countermeasure sample, which comprises the following steps: (1) creating a desired data set; (2) Constructing a network model, and constructing a final model CycleGAN-L2 through an improved CycleGAN-VC2 model; (3) training the model using deweighting learning; (4) Performance testing was performed in the test set using the trained model and defenses against challenge samples generated by CW2, MIM, ADA and FGSM. According to the invention, the CycleGAN-VC2 is taken as a backbone network, and an antagonistic sample and a benign sample are added in a training set, so that side effects of a defense method are reduced. And according to the idea of decrement learning, deleting benign samples in the training process, accelerating the training of the model, restricting the loss function by using the L2 distance, encouraging the model to screen more features, and thus realizing the defense against the samples.

Description

Defensive method for countermeasure sample of speaker recognition system

Technical Field

The invention belongs to the technical field of voice systems, in particular to the field of countermeasure defense in speaker recognition, and more particularly relates to a defense method for a countermeasure sample of a speaker recognition system.

Background

Defensive audio challenge samples are an important topic in challenge defenses, the defensive effects of which directly affect the reliability of authentication, judicial authentication, and personalized services on smart devices. With the continued evolution and enhancement of audio counterattack, it is becoming increasingly important to protect speaker recognition systems from malicious interference and attack challenges. In defending against audio challenge samples, it is critical to effectively defend against challenge samples while not affecting the accuracy of benign samples. This is critical to maintaining the accuracy and robustness of the speaker recognition system.

Security in the image field has been widely studied at present. However, in the speech field, especially in speaker recognition systems, defense methods against samples have not been fully explored and studied. Moreover, security issues in speaker recognition systems are not negligible. If the speaker recognition system is used to provide financial related services or personal privacy related services, and is not adequately secured, the personal property and reputation can be greatly compromised.

The challenge defense can be divided into two methods, active defense and passive defense. The active defense uses the challenge sample for data enhancement, retraining the speaker recognition model to improve its robustness. Passive defense is achieved by adding new components without modifying the original model, and the passive defense method can be classified into a detection method and a purification method according to the functions of the new components, for detecting and eliminating the influence of the challenge sample.

The patent application of application number 202310123820.9 discloses a universal detection system and method for a speaker recognition system challenge sample, wherein the system comprises a multi-channel audio interference module for performing audio interference on input original audio to generate an audio variety set corresponding to the original audio; the speaker system recognition module is used for inputting the generated audio variety set into the speaker recognition system, and extracting a scoring sequence and a discrimination result sequence corresponding to the audio variety set; the stability feature extraction module is used for extracting statistics features of the obtained score sequence and the discrimination result sequence, and connecting the extracted feature value with the score sequence to obtain stability expression features; and the single-class judging module judges whether the input original audio is a countermeasure sample according to the stability representing characteristics. A universal detection method is also disclosed. The system provided by the invention can be self-adaptive to the attack detection of the countering sample under various conditions, so that the safety of voice recognition is enhanced.

The patent application of application number 202210659947.8 discloses a voiceprint recognition challenge sample detection method based on different migration capacities and decision boundary attacks, which comprises the steps of firstly, preprocessing data of speaker signals, and dividing the speaker signals into a training set and a testing set; building a voiceprint recognition model according to the training data; generating challenge samples on the target model using different challenge methods; inputting a mixed sample set of a clean sample and an countermeasure sample into a target model and a detection model to obtain two corresponding labels, comparing whether the labels are consistent or not, if not, setting a detection value, namely, the countermeasure disturbance proportion as 0, and if so, attacking the sample with the unchanged label by utilizing a decision boundary attack method to obtain the countermeasure disturbance proportion; utilizing decision boundary attack in a clean sample set to obtain a batch of anti-disturbance proportions, and determining a detected decision threshold value in the disturbance proportions; and detecting the countermeasure sample by using the determined decision threshold, wherein if the disturbance proportion value of the sample is larger than the threshold, the sample is a clean sample, otherwise, the sample is the countermeasure sample.

The above two patents both belong to detection methods in defense methods, wherein a general detection system and method for a speaker recognition system challenge sample is to preprocess audio to be detected so as to enrich the variety types of the audio to be recognized, then put the audio into a recognition system to obtain a scoring sequence and a discrimination result sequence, and perform feature extraction on the obtained scoring sequence and discrimination result sequence for detection. The voiceprint recognition challenge sample detection method based on different migration capacities and decision boundary attacks trains two speaker recognition models, if the output results of the two models are inconsistent, challenge samples are detected, some challenge samples possibly exist in the input and are not detected, a HopSkipJumpAttack (HSJA) decision boundary attack method is adopted for the challenge samples, the challenge samples are moved out of the decision boundary, and the challenge samples are detected by comparing with a decision threshold.

The present inventors devised a purification method different from the protection method of the above two patents, and found no similar patent document.

Disclosure of Invention

The invention aims to provide a defense method for a speaker recognition system challenge sample, which aims to generate the problem that the training is very slow and the side effect of the defense in the challenge network, takes CycleGAN-VC2 as a backbone network, adds the challenge sample and a benign sample in the training set, and reduces the side effect of the defense on the benign sample; and according to the idea of decrement learning, deleting benign samples in the training process, accelerating the training of the model, restricting the loss function by using the L2 distance, encouraging the model to screen more features, and thus realizing the defense against the samples.

The technical scheme of the invention is as follows:

a defending method for a speaker recognition system to fight against a sample maintains the accuracy and the robustness of the speaker recognition system by fusing decrement learning and improved CycleGAN-VC 2; firstly, benign samples and antagonistic samples are added into a data set for training of a generator, and a method of decrement learning is integrated in the training process to delete benign data; secondly, improving the CycleGAN-VC2, and adopting an L2 distance constraint loss function, wherein the method comprises the following steps of:

step 1, manufacturing a required data set;

step 2, constructing a network model, and constructing a final model CycleGAN-L2 through an improved CycleGAN-VC2 model;

step 3, training the model CycleGAN-L2 by using decrement learning;

and 4, performing performance test in the test set by using the model obtained by training, and defending the countermeasure samples generated by CW2, MIM, ADA and FGSM.

The method comprises the following steps of:

step 1, acquiring Librispeech voice data sets, randomly selecting 10 speakers from the Librispeeches voice data sets, using 100 audio files of each person as benign data sets, performing PGD attack on the benign data sets, generating 1000 countersamples serving as countersample data sets, combining the benign data sets and the countersample data sets to obtain natural data sets required by experiments, and using the natural data sets according to 9: the ratio of 1 is divided into a training set and a test set, wherein the ratio of benign samples and challenge samples in the training set and the test set is 1:1, a step of;

step 2, loss of loop consistency L in the CycleGAN-VC2 model _cyc And identity mappability loss L _id The modification is carried out to obtain a CycleGAN-L2 model, and the specific formula is as follows:

wherein G is _nat→ori And G _ori→nat Is a generator; in the cycle consistency loss, G _nat→ori (x) Generating benign data y, G from x in natural data set _ori→nat (y) generating natural data x from benign samples y; in loss of identity mappability, G _nat→ori (y) generating benign data y, G from benign data y _ori→nat (x) Generating natural data x from the natural data x; cycle consistency penalty L _cyc And identity mappability loss L _id Respectively constraining by using L2 distances;

step 3, training the CycleGAN-L2 model by adopting a decrement learning method, and if G is generated in the training process _nat→ori The input is benign sample, the benign sample that outputs makes the accuracy of speaker identification model x-vector unchanged or decline then remove the benign data in the natural dataset;

and 4, testing benign samples and challenge samples of the test set separately, and generating 1000 challenge samples by using CW2, MIM, ADA and FGSM respectively to test the defensive effect.

The invention has the following characteristics:

1. the invention improves the CycleGAN-VC2 model aiming at the speaker recognition system, uses the L2 distance constraint loss function, encourages the model to select more characteristics during training, and improves the learning performance of the model.

2. The invention uses natural data set to replace the antagonistic sample data set for the speaker recognition system, so that the model learns the characteristics of benign data, and the side effect of the model on benign samples is reduced.

3. The invention uses decrement learning in the training process aiming at the speaker recognition system, reduces the data in the training set in the training model, and greatly reduces the time required by the training to generate the countermeasure network.

Drawings

FIG. 1 is a business flow diagram of the present invention;

FIG. 2 is a primary training flow diagram of the present invention;

FIG. 3 is a secondary training flow diagram of the present invention;

FIG. 4 is a block diagram of a generator;

FIG. 5 is a block diagram of a arbiter;

FIG. 6 is a comparison of the effects of different loss functions on the defense of PGDs;

figure 7 is a waveform diagram of the different defenses produced;

figure 8 is a graph of the spectrum generated by different defenses.

Detailed Description

The invention is further described below by means of the figures and examples.

Referring to fig. 1-5, a defending method for a speaker recognition system against a sample maintains the accuracy and robustness of the speaker recognition system by fusing decrement learning and improved CycleGAN-VC2, comprising the steps of:

step 1, manufacturing a required data set;

step 3, training the model by using decrement learning;

The method comprises the following specific steps:

wherein G is _nat→ori And G _ori→nat Is a generator. G _nat→ori (x) Generating benign data y, G from x in natural data set _ori→nat (y) generating natural data x from benign samples y. G _nat→ori (y) generating benign data y, G from benign data y _ori→nat (x) The natural data x is generated. Cycle consistency penalty L _cyc And identity mappability loss L _id The constraints are respectively imposed by L2 distances.

Step 3, training the CycleGAN-L2 model by adopting a decrement learning method, and if G is generated in the training process _nat→ori The input is benign sample, and the output benign sample leads the accuracy of the speaker recognition model x-vector to be unchanged or to be reduced, so that benign data in the natural data set are removed.

In CycleGAN-VC2, the L1 distance is used for model training. The invention combines the characteristic of a speaker recognition scene, which has a plurality of speakers, carries out partial modification on the basis of a CycleGAN-VC2 model, and encourages a generator model to learn more features by using an L2 distance constraint loss function.

Referring to FIG. 6, to verify the effectiveness of the L2 loss function in the present invention, the defensive effects of two different loss function methods, cycleGAN-L1 and CycleGAN-L2, were verified on the test set. Wherein, CSI represents no target attack in the closed set identification, OSI-simple represents simple target attack in the open set identification, and CSI-hard represents difficult target attack in the closed set identification. As can be seen from FIG. 6, the accuracy acc of the target model in CycleGAN-L2 under different speaker recognition tasks _adv Better than CycleGAN-L1. This verifies the effectiveness of the L2 distance for the present invention.

In the selection of the countermeasure network based on the generation of the countermeasure defenses, the invention aims to reduce the side effects of the CycleGAN-L2 model on benign samples and has a certain defensive effect on various types of attacks.

For this purpose, a generator G of the invention _nat→ori The input of (2) is performed in two times, the first input is real data including a challenge sample and a benign sample, and the second input is a benign sample, so that the adverse effect of the CycleGAN-L2 model on the benign sample is minimized.

In tables 1 and 2, acc _ben And acc (sic) _adv The accuracy of the speaker recognition model to recognize benign samples and the accuracy of the speaker recognition model to recognize antagonistic samples are respectively. We are primarily defending against the following non-targeted attacks: FGSM (Fast Gradient Sign Method) is a fast gradient sign attack, MIM (Momentum Iterative Fast Gradient Sign Method) is a gradient-based momentum iteration attack, PGD (Project Gradient Descent) is a projection gradient descent attack, CW2 (Carlini&Wagner) is an optimization-based attack, ADA (A Highly Stealthy Adaptive Decay Attack) is a highly covert adaptive attack.

QT (Quantization), AS (Average Smoothing) and MS (Median Smo-othing) are time-domain based methods that defend by quantization, average smoothing and Median smoothing, respectively. DS (Down Sampling), LPF (Low Pass Filter) and BPF (B and Pass Filter) are frequency domain based methods that defend by downsampling, lowpass filtering and bandpass filtering, respectively. OPUS and SPEEX are speech compression based methods that are defended by different speech compression algorithms, respectively. CycleGAN-L2 and CycleGAN-L1 are speech synthesis based methods, and CycleGAN-L2 is an improvement over CycleGAN-L1.

As can be seen from Table 1, in the closed set identification task, whether L1 or L2 is used, acc is added as long as the challenge sample is added to the training data while the benign sample is added _ben The values of (2) were all maintained at 99.9%. And compared with QT, AS, MS, DS, LPF, BPF, OPUS, SPEEX, cycleGAN-L2 has the protection method of acc _ben The value of (2) dominates. Under the condition of defending other attack methods, CYC-L2 is better than other methods, acc _adv 94.7%, 35.5%, 75.1%, 99.6% and 88.5%, respectively. However, CYC-L2 is slightly less effective than QT in defending against ADA, differing only by 3.2%.

TABLE 1

TABLE 2

As can be seen from Table 2, in the open set identification task, the acc of the L2 method is used _ben The value of (2) is 97.7% higher than using the L1 method, indicating that L2 is better than L1, with minimal side effects on benign samples. The effect of CYC-L2 is better than that of other methods when defending other attacks, for example, the accuracy of a model is 88.3% when the CYC-L2 defends FGSM. And compared with CYC-L1 and QT, AS, LPF, the CYC-L2 has the acc _adv The differences were 1.1%, 12.3%, 40.3% and 38.2%, respectively.

The invention provides a defending method aiming at an antagonism sample of a speaker recognition system, wherein a model is named as CycleGAN-L2, and the model uses an L2 distance constraint loss function to encourage the model to select more characteristics during training, so that the training effect is further improved, and a method of decrement learning is introduced, so that the time required for generating an antagonism network during training is greatly reduced. In order to reduce the side effect of the model on benign samples, the training set is added with the antagonistic samples and the benign samples. Experimental results show that in the closed set identification and open set identification tasks, the invention has the acc _ben 99.9% and 97.7%, respectively, have minimal impact on benign samples. In defending against FGSM, MIM, PGD, CW and ADA in open set identification, acc _adv Better than other methods, and has certain resistance to different attacks. Figures 7 and 8 are defense visualizations of different defense methods against MIM attacks.

In conclusion, the invention converts the challenge sample into the benign sample based on the generation of the challenge network, adds the benign sample in the data set so as not to influence the recognition accuracy of the target model on the benign sample, and uses the decrement learning to greatly reduce the training time of the model when training the model, and can be deployed on any speaker recognition model.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.

Claims

1. A method of defending against a sample for a speaker recognition system, comprising: the accuracy and the robustness of the speaker recognition system are maintained by fusing decrement learning and improved CycleGAN-VC 2; firstly, benign samples and antagonistic samples are added into a data set for training of a generator, and a method of decrement learning is integrated in the training process to delete benign data; secondly, improving the CycleGAN-VC2, and adopting an L2 distance constraint loss function, wherein the method comprises the following steps of:

step 1, manufacturing a required data set;

step 3, training the model CycleGAN-L2 by using decrement learning;

step 4, performing performance test in a test set by using the model obtained by training, and defending the countermeasure samples generated by CW2, MIM, ADA and FGSM;

step 1 is specifically that a librispech voice data set is obtained, 10 speakers are randomly selected from the librispech voice data set, 100 audio files of each person are used as benign data sets, PGD attack is carried out on the benign data sets, 1000 countersamples are generated to be used as countersample data sets, the benign data sets and the countersample data sets are combined to obtain a natural data set required by an experiment, and the natural data set is processed according to 9:1 into a training set and a test set, wherein the ratio of benign samples to challenge samples in the training set and the test set is 1:1;

the step 2 is specifically that the cycle consistency loss L in the CycleGAN-VC2 model _cyc And identity mappability loss L _id The modification is carried out to obtain a CycleGAN-L2 model, and the specific formula is as follows:

wherein G is _nat→ori And G _ori→nat Is a generator; in the cycle consistency loss, G _nat→ori (x) Generating benign data y, G from x in natural data set _ori→nat (y) generating natural data x from benign samples y; in loss of identity mappability, G _nat→ori (y) generating benign data y, G from benign data y _ori→nat (x) Is to generate natural data xNatural data x; cycle consistency penalty L _cyc And identity mappability loss L _id Respectively constraining by using L2 distances;

the step 3 is specifically that a decrement learning method is adopted to train the CycleGAN-L2 model, and if G is generated in the training process _nat→ori The input is benign sample, the benign sample that outputs makes the accuracy of speaker identification model x-vector unchanged or decline then remove the benign data in the natural dataset;

the step 4 specifically includes separately testing benign samples and challenge samples of the test set, and generating 1000 challenge samples by using CW2, MIM, ADA and FGSM, respectively, to perform a defensive effect test.