CN115188384A

CN115188384A - Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising

Info

Publication number: CN115188384A
Application number: CN202210653117.4A
Authority: CN
Inventors: 徐东伟; 蒋斌; 房若尘; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-10-14

Abstract

A voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising is characterized in that data preprocessing is firstly carried out, and speaker voice data are preprocessed; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the clean data is subjected to wavelet transformation in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.

Description

Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising

Technical Field

The invention relates to a method for defending a voiceprint recognition confrontation sample, and belongs to the field of deep learning safety.

Background

With the rapid development of deep learning, deep learning is one of the most common technologies of artificial intelligence, and people's lives are influenced and changed in various aspects, and typical applications include the fields of intelligent home, intelligent driving, voice recognition, voiceprint recognition and the like. Deep learning, as a very complex software system, can also face various hacking attacks. Hackers can also threaten property security, personal privacy, traffic security, and bulletin security through deep learning systems. Attacks against deep learning systems typically include the following. 1. And (3) stealing the model, wherein a hacker steals the model file deployed in the server by various advanced means. 2. Data virus injection, which is to inject abnormal data into a deep learning training sample, so that a model can generate classification errors when meeting certain conditions, for example, a back door attack algorithm is to add a back door mark into poisoning data, so that the model is poisoned. 3. A challenge sample, which refers to an input sample formed by deliberately adding a subtle perturbation to the data set, causes the model to give an erroneous output with high confidence. Briefly, the challenge sample makes the deep learning model misclassify by superimposing a well-constructed human-imperceptible perturbation on the element data. The safety of deep learning becomes a problem to be urgently solved today.

Defense methods are mainly divided into two categories: defense against samples and detection of against samples. The main purpose of defense of the confrontation sample is to restore the classification label of the confrontation sample to the label of the normal sample; the main purpose of the countermeasure sample detection is to find the countermeasure sample in the sample set and reject it.

Disclosure of Invention

The invention provides a method for defending a vocal print recognition countermeasure sample based on cosine similarity and voice denoising, which aims to overcome the defects in the prior art.

The technical method adopted by the invention to solve the technical problem is as follows: preprocessing data, namely preprocessing the voice data of the speaker used by the user; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the wavelet transformation of the clean data in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.

A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising comprises the following steps:

step 1: carrying out data preprocessing on a speaker voice signal;

step 2: building a voiceprint recognition model;

and step 3: designing a confrontation sample according to the voiceprint model;

and 4, step 4: obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;

and 5: detecting an antagonistic sample according to a decision threshold; will confront the sample

Wavelet transformation is carried out, and then cosine similarity value C 'is obtained' _i Comparing the cosine similarity value of the confrontation sample with a decision threshold T if C 'is satisfied' _i If < T, the determination is made

To challenge the sample;

step 6: denoising the neural network defense confrontation sample;

further, step 1 specifically includes:

firstly, data extraction is carried out on an existing voice file (WAV format), and voice data extraction is carried out on speaker voice by utilizing a library audio processing python toolkit, wherein the steps are as follows:

X _i ,sr＝librosa(T _i ,sr＝None),i＝1,2,...n+m (1)

wherein X _i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T _i Is the ith speaker voice file;

normalizing the speech signal data set and dividing the data set D into training sets D ^train And test set D ^test Wherein the data set

D ^train ＝{(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _n ,Y _n )}

D ^test ＝{(X _n+1 ,Y _n+1 ),(X _n+2 ,Y _n+2 ),...,(X _n+m ,Y _n+m )}

X _i ＝(X _i1 ,X _i2 ,...,X _id ) D represents X _i The length of the data of (a) is,

represents a class c tag; the normalized formula is:

wherein

Denotes the normalized sample, max (X) _i ) Representing the maximum value of d sampling points in the sample, and normalizing

Further, step 2 specifically includes:

predesignating the structure and parameters of the classification model, and not changing; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:

target model:

F _target (. O) a probability vector representing the model output;

further, step 3 specifically includes:

the challenge sample is defined as:

wherein, delta _i Is a perturbation added to the original sample;

the generation of the counterattack sample is described by taking an optimized counterattack method as an example; the attack method based on optimization is essentially a confrontation sample generation method based on gradient;

the optimization function is defined as:

further, step 4 specifically includes:

obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;

suppression of disturbances e _i (t) obtaining a true signal f _i (t) after wavelet transform, the signal can beMaximum removal of signal f _i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e _i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;

the wavelet threshold denoising steps are as follows:

I. for speech signal f _i (t) performing wavelet transform; selecting an orthogonal wavelet and the number N of decomposition layers, for signal f _i (t) performing wavelet decomposition of N layers;

for sample signal f _i Performing linear threshold processing on the wavelet transform coefficient of (t); processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer; the threshold formula is as follows:

wherein w is a wavelet coefficient and λ is a selected threshold;

III, reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction _i ^de (t); then there are:

cleaning the original clean sample

And clean samples after wavelet transform

Obtaining corresponding output probability vector in input voiceprint recognition model

And

then the cosine similarity value between the two vectors is calculated as

Then, selecting a decision threshold value from the cosine similarity value set C of the clean sample, and if the false detection rate is set to be b%, the decision threshold value T is the value of the b-th% in sort (C) indicates that C is sorted from small to large);

further, step 6 specifically includes:

firstly, training a denoising neural network, wherein the adopted denoising neural network is DCCRN, the DCCRN firstly transforms the noisy voice through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then defines an input complex matrix I = I _r +jI _i Complex convolution filter W = W _r +jW _i Wherein the matrix W _r And W _i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer _out Comprises the following steps:

F _out ＝(I _r *W _r -I _i *W _i )+j(I _r *W _i +I _i *W _r ) (10)

similar to complex convolution, given a complex input X _r And X _i Real and imaginary parts of, the output L of the complex LSTM _out Can be defined as:

L _rr ＝LSTM _r (X _r )；L _ii ＝LSTM _i (X _i ) (11)

L _ri ＝LSTM _i (X _r )；L _ir ＝LSTM _r (X _i ) (12)

L _out ＝(L _rr -L _ii )+(L _ri +L _ir ) (13)

wherein LSTM _r And LSTM _i Are the real and imaginary parts of two conventional LSTM modules;

in the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M _r +jM _i The denoised voice can be obtained after the noisy voice passes through the mask matrix

S＝M*Y (14)

Wherein S = S _r +jS _i This is the complex spectrum of clean speech, Y = Y _r +jY _i Complex frequency spectrum of noisy speech;

given S and Y, M can be calculated by the following formula:

m is expressed in polar coordinates as:

M _phase ＝arctan2(M _i ,M _r ) (18)

with this noisy speech can be estimated as:

the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:

wherein <, > represents the dot product between two vectors when estimating speech

Very close to the clean speech S, e _n ≈0；

After the DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.

The working principle of the invention is as follows:

data preprocessing of the used speech data set: and acquiring the time domain data of the original waveform of each section of voice, dividing the data into a training set and a test set, and performing normalization processing.

Building a voiceprint recognition model: the structure and parameters of the voiceprint recognition model are specified in advance and do not change any more. A data set suitable for the recognition model is also given, namely a speaker voice sample which comprises input time domain waveform data used for speaker recognition and a corresponding classification label, and the sample set in the data set can be predicted and output by the model with high precision.

Designing a confrontation sample according to a voiceprint recognition model: several commonly used white-box counterattack methods were chosen. And adjusting the gradient direction of the input data according to the parameters of the voiceprint recognition model, so that the voiceprint recognition model generates an error label under the condition that the input sample changes slightly.

Obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method: the method comprises the steps of performing wavelet decomposition on a clean sample, then performing wavelet reconstruction to achieve the purpose of denoising the sample through wavelet transformation, inputting the clean sample before and after the wavelet transformation into a voiceprint recognition network to obtain two model output probability vectors, then calculating cosine similarity values between corresponding vectors, setting a threshold value in the cosine similarity values, and selecting the threshold value by taking a false detection rate as a standard.

The antagonistic samples are detected according to a decision threshold: and performing wavelet transformation on the countermeasure sample, inputting the sample before and after the wavelet transformation into the voiceprint recognition model to obtain an output probability vector, obtaining a cosine similarity value, comparing the cosine similarity value with a decision threshold, and judging as the countermeasure sample if the cosine similarity value is smaller than the decision threshold.

Denoising neural network defense confrontation sample: firstly, a denoising neural network is trained, then the confrontation samples which are not detected in the previous step are input into the denoising network, a denoised sample is finally obtained, and then a large number of confrontation samples lose the confrontation. To achieve further defense.

The countermeasure sample of the invention is different from the normal sample, the countermeasure sample has more noise, after denoising, the change is larger compared with the clean sample, the change is reflected to the model, namely the change of the model output probability before and after denoising of the countermeasure sample is larger, the change can be measured by the cosine similarity value between vectors, the change of the output vector is small, the cosine similarity value is large, otherwise, the change is small. Output challenge samples can be detected using this feature. In order to realize further defense, the confrontation samples which are not detected after the detection step are further denoised, and the defense effect is improved. The invention enhances the safety of the voiceprint recognition model.

The invention has the advantages that: the method can accurately detect the countermeasure sample in the data, can further purify and defend the undetected countermeasure sample, effectively reduces the risk brought by the countermeasure sample, and enhances the safety of the voiceprint recognition model.

Drawings

FIG. 1 is a basic flow diagram of the process of the present invention.

FIG. 2 is a flow chart of the wavelet transform denoising of the present invention.

Fig. 3 is a diagram illustrating a structure of a DCCRN network according to the present invention.

The specific implementation mode is as follows:

the technical scheme of the invention is further explained by combining the attached drawings.

Example 1

A method for defending a vocal print recognition confrontation sample based on cosine similarity and voice denoising comprises the following steps:

(1) The data preprocessing step is carried out on the speaker voice signal:

first, data extraction is performed on an existing voice file (WAV format), and voice data extraction is performed on speaker voice by using a library audio processing python toolkit, as follows:

X _i ,sr＝librosa(T _i ,sr＝None),i＝1,2,...n+m (1)

wherein X _i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T _i Is the ith speaker voice file.

D ^train ＝{(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _n ,Y _n )}

D ^test ＝{(X _n+1 ,Y _n+1 ),(X _n+2 ,Y _n+2 ),...,(X _n+m ,Y _n+m )}

representing a class c tag. The normalized formula is:

wherein

Denotes the normalized sample, max (X) _i ) To representThe maximum value of d sampling points in the sample is normalized

(2) Building a voiceprint recognition model: the structure and parameters of the classification model are pre-specified and do not change. The classification model structure adopted by the invention mainly comprises a 1D convolutional layer, a maximum pooling layer, a batch normalization layer and a full connection layer, the specific structure is shown in Table 1, training is carried out by utilizing a training data set, and a voiceprint recognition classification model is as follows:

target model:

F _target (. Cndot.) represents the probability vector of the model output.

(3) Designing a confrontation sample according to the voiceprint recognition classifier:

the challenge sample is defined as:

wherein, delta _i Is a disturbance added to the original specimen.

The generation of the countermeasure sample is described here by taking an optimization-based countermeasure attack method as an example. The attack method based on optimization is essentially a countermeasure sample generation method based on gradient.

The optimization function is defined as:

(4) Calculating cosine similarity selection decision threshold in the clean samples according to wavelet transform:

suppression of disturbances e _i (t) obtaining a true signal f _i (t) after wavelet transform, the signal f can be removed to the maximum extent _i The correlation of (t) concentrates most of the energy on a few wavelet coefficients of large amplitude. And disturbance e _i And (t) after wavelet transformation, the wavelet transformation is distributed on all time axes at all scales, and the amplitude is not very large. The purpose of noise reduction of the voice signal can be achieved by a threshold filtering method.

The wavelet threshold denoising steps are as follows:

I. for speech signal f _i (t) performing wavelet transform. Selecting an orthogonal wavelet and the number N of decomposition layers, for signal f _i (t) performing wavelet decomposition of N layers.

For sample signal f _i And (t) performing linear threshold processing on the wavelet transform coefficients. And processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer. The threshold formula is as follows:

where w is the wavelet coefficient and λ is the selected threshold.

And III, reconstructing the processed wavelet coefficient. Reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of the wavelet decomposition and the high-frequency coefficient from the first layer to the N layer after the processing, thereby obtaining a voice signal f after noise reduction _i ^de (t) of (d). Then there are:

cleaning the original clean sample

And clean samples after wavelet transform

And

then the cosine similarity value between the two vectors is calculated as

And then selecting a decision threshold from the cosine similarity value set C of the clean samples, wherein if the false detection rate is set to be b%, the decision threshold T is the value of the b-th% in sort (C) (sort (C) represents that C is sorted from small to large).

(5) Detecting the confrontation sample according to a decision threshold: will confront the sample

Wavelet transformation is carried out, and then cosine similarity value C 'is obtained' _i Comparing the cosine similarity value of the antagonistic sample with a decision threshold T if C 'is satisfied' _i If < T, the determination is made

To fight the sample.

(6) Denoising the undetected confrontation sample by using a denoising neural network:

firstly, a denoising neural network is trained, the denoising neural network adopted by the invention is DCCRN, the DCCRN firstly transforms the voice with noise through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then an input complex matrix I = I is defined _r +jI _i Complex convolution filter W = W _r +jW _i Wherein the matrix W _r And W _i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer _out Comprises the following steps:

F _out ＝(I _r *W _r -I _i *W _i )+j(I _r *W _i +I _i *W _r ) (10)

L _rr ＝LSTM _r (X _r )；L _ii ＝LSTM _i (X _i ) (11)

L _ri ＝LSTM _i (X _r )；L _ir ＝LSTM _r (X _i ) (12)

L _out ＝(L _rr -L _ii )+(L _ri +L _ir ) (13)

wherein LSTM _r And LSTM _i Are the real and imaginary parts of two conventional LSTM modules.

S＝M*Y (14)

Wherein S = S _r +jS _i This is the complex spectrum of clean speech, Y = Y _r +jY _i Is a complex spectrum of noisy speech.

Given S and Y, M can be calculated by the following formula:

m is expressed in polar coordinates as:

M _phase ＝arctan2(M _i ,M _r ) (18)

with this noisy speech can be estimated as:

Very close to the clean speech S, e _n ≈0。

After a DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.

Example 2: data in actual experiments

(1) And (4) selecting experimental data.

The data set used in the experiment is an AISHELL-1 voice data set, which collects voices recorded by speakers in different age groups, different sexes and different regions in a quiet environment, and the sampling rate is 16000. We selected 20 human voices as the data set of the voiceprint recognition model, and the time domain data length of the original waveform we extracted for each sentence of voice was 60000. The data is preprocessed and stored as a data set of an array (butsize, 60000,1) and corresponding label data is generated, and the processed data sets are stored as npy files. The input data of the de-noising network DCCRN is the generated confrontation samples of FGSM, BIM, PGD, deepFool and CW, and the corresponding clean samples are output.

(2) And (5) determining parameters.

The threshold of thresholding in the wavelet transform, λ =0.02, is used to detect the decision threshold of the challenge sample, T =0.91955.

(3) And (5) experimental results.

The invention utilizes 5 attack algorithms (FGSM, BIM, PGD, deepFool, CW) to generate 5 confrontation samples, utilizes the defense method provided by the invention to defend the 5 confrontation samples, utilizes the defense success rate ACC and the false detection rate FPR as the defense effect of the defense method, and is compared with other defense (detection) methods, and the experimental result is shown in table 1.

TABLE 1 defense effects

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising is characterized by comprising the following steps:

step 1: carrying out data preprocessing on a speaker voice signal;

and 2, step: building a voiceprint recognition model;

and 3, step 3: designing a confrontation sample according to the voiceprint model;

and 5: according toA decision threshold to detect an antagonistic sample; will confront the sample

To challenge the sample;

and 6: denoising neural network defense countermeasure samples.

2. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 1 specifically comprises:

X _i ,sr＝librosa(T _i ,sr＝None),i＝1,2,...n+m (1)

D ^train ＝{(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _n ,Y _n )}

D ^test ＝{(X _n+1 ,Y _n+1 ),(X _n+2 ,Y _n+2 ),...,(X _n+m ,Y _n+m )}

represents a class c tag; the normalization formula is:

wherein

3. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 2 specifically comprises:

the structure and parameters of the classification model are specified in advance and do not change; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:

target model:

F _target (. O) a probability vector representing the model output;

4. the method for defending a vocal print recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 3 specifically comprises:

the challenge sample is defined as:

wherein, delta _i Is a perturbation added to the original sample;

the optimization function is defined as:

5. the method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 4 specifically comprises:

suppression of disturbances e _i (t) obtaining a true signal f _i (t) after wavelet transform, the signal f can be removed to the maximum extent _i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e _i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;

the wavelet threshold denoising steps are as follows:

for sample signal f _i Performing linear threshold processing on the wavelet transform coefficient of (t); for each layer height from the first layer to the N layersThe frequency coefficient is processed through a threshold function, and the low-frequency coefficient of each layer is not processed; the threshold formula is as follows:

wherein w is a wavelet coefficient and λ is a selected threshold;

reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction _i ^de (t); then there are:

cleaning the original clean sample

And clean samples after wavelet transform

And

then the cosine similarity value between the two vectors is calculated as

6. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 6 specifically comprises: