CN115188384A - Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising - Google Patents
Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising Download PDFInfo
- Publication number
- CN115188384A CN115188384A CN202210653117.4A CN202210653117A CN115188384A CN 115188384 A CN115188384 A CN 115188384A CN 202210653117 A CN202210653117 A CN 202210653117A CN 115188384 A CN115188384 A CN 115188384A
- Authority
- CN
- China
- Prior art keywords
- sample
- voice
- cosine similarity
- denoising
- threshold
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000007123 defense Effects 0.000 title claims abstract description 20
- 239000013598 vector Substances 0.000 claims abstract description 21
- 230000009466 transformation Effects 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000013145 classification model Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims abstract description 5
- 238000001514 detection method Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 238000000354 decomposition reaction Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 9
- 238000001228 spectrum Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 238000013075 data extraction Methods 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 230000003042 antagnostic effect Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 230000001755 vocal effect Effects 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 231100000572 poisoning Toxicity 0.000 description 1
- 230000000607 poisoning effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/20—Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Game Theory and Decision Science (AREA)
- Complex Calculations (AREA)
Abstract
A voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising is characterized in that data preprocessing is firstly carried out, and speaker voice data are preprocessed; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the clean data is subjected to wavelet transformation in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.
Description
Technical Field
The invention relates to a method for defending a voiceprint recognition confrontation sample, and belongs to the field of deep learning safety.
Background
With the rapid development of deep learning, deep learning is one of the most common technologies of artificial intelligence, and people's lives are influenced and changed in various aspects, and typical applications include the fields of intelligent home, intelligent driving, voice recognition, voiceprint recognition and the like. Deep learning, as a very complex software system, can also face various hacking attacks. Hackers can also threaten property security, personal privacy, traffic security, and bulletin security through deep learning systems. Attacks against deep learning systems typically include the following. 1. And (3) stealing the model, wherein a hacker steals the model file deployed in the server by various advanced means. 2. Data virus injection, which is to inject abnormal data into a deep learning training sample, so that a model can generate classification errors when meeting certain conditions, for example, a back door attack algorithm is to add a back door mark into poisoning data, so that the model is poisoned. 3. A challenge sample, which refers to an input sample formed by deliberately adding a subtle perturbation to the data set, causes the model to give an erroneous output with high confidence. Briefly, the challenge sample makes the deep learning model misclassify by superimposing a well-constructed human-imperceptible perturbation on the element data. The safety of deep learning becomes a problem to be urgently solved today.
Defense methods are mainly divided into two categories: defense against samples and detection of against samples. The main purpose of defense of the confrontation sample is to restore the classification label of the confrontation sample to the label of the normal sample; the main purpose of the countermeasure sample detection is to find the countermeasure sample in the sample set and reject it.
Disclosure of Invention
The invention provides a method for defending a vocal print recognition countermeasure sample based on cosine similarity and voice denoising, which aims to overcome the defects in the prior art.
The technical method adopted by the invention to solve the technical problem is as follows: preprocessing data, namely preprocessing the voice data of the speaker used by the user; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the wavelet transformation of the clean data in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.
A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising comprises the following steps:
step 1: carrying out data preprocessing on a speaker voice signal;
step 2: building a voiceprint recognition model;
and step 3: designing a confrontation sample according to the voiceprint model;
and 4, step 4: obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
and 5: detecting an antagonistic sample according to a decision threshold; will confront the sampleWavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the confrontation sample with a decision threshold T if C 'is satisfied' i If < T, the determination is madeTo challenge the sample;
step 6: denoising the neural network defense confrontation sample;
further, step 1 specifically includes:
firstly, data extraction is carried out on an existing voice file (WAV format), and voice data extraction is carried out on speaker voice by utilizing a library audio processing python toolkit, wherein the steps are as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file;
normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,represents a class c tag; the normalized formula is:
whereinDenotes the normalized sample, max (X) i ) Representing the maximum value of d sampling points in the sample, and normalizing
Further, step 2 specifically includes:
predesignating the structure and parameters of the classification model, and not changing; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:
target model:
F target (. O) a probability vector representing the model output;
further, step 3 specifically includes:
the challenge sample is defined as:
wherein, delta i Is a perturbation added to the original sample;
the generation of the counterattack sample is described by taking an optimized counterattack method as an example; the attack method based on optimization is essentially a confrontation sample generation method based on gradient;
the optimization function is defined as:
further, step 4 specifically includes:
obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal can beMaximum removal of signal f i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;
the wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform; selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers;
for sample signal f i Performing linear threshold processing on the wavelet transform coefficient of (t); processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer; the threshold formula is as follows:
wherein w is a wavelet coefficient and λ is a selected threshold;
III, reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction i de (t); then there are:
cleaning the original clean sampleAnd clean samples after wavelet transformObtaining corresponding output probability vector in input voiceprint recognition modelAndthen the cosine similarity value between the two vectors is calculated as
Then, selecting a decision threshold value from the cosine similarity value set C of the clean sample, and if the false detection rate is set to be b%, the decision threshold value T is the value of the b-th% in sort (C) indicates that C is sorted from small to large);
further, step 6 specifically includes:
firstly, training a denoising neural network, wherein the adopted denoising neural network is DCCRN, the DCCRN firstly transforms the noisy voice through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then defines an input complex matrix I = I r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules;
in the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is the complex spectrum of clean speech, Y = Y r +jY i Complex frequency spectrum of noisy speech;
given S and Y, M can be calculated by the following formula:
m is expressed in polar coordinates as:
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
wherein <, > represents the dot product between two vectors when estimating speechVery close to the clean speech S, e n ≈0;
After the DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
The working principle of the invention is as follows:
data preprocessing of the used speech data set: and acquiring the time domain data of the original waveform of each section of voice, dividing the data into a training set and a test set, and performing normalization processing.
Building a voiceprint recognition model: the structure and parameters of the voiceprint recognition model are specified in advance and do not change any more. A data set suitable for the recognition model is also given, namely a speaker voice sample which comprises input time domain waveform data used for speaker recognition and a corresponding classification label, and the sample set in the data set can be predicted and output by the model with high precision.
Designing a confrontation sample according to a voiceprint recognition model: several commonly used white-box counterattack methods were chosen. And adjusting the gradient direction of the input data according to the parameters of the voiceprint recognition model, so that the voiceprint recognition model generates an error label under the condition that the input sample changes slightly.
Obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method: the method comprises the steps of performing wavelet decomposition on a clean sample, then performing wavelet reconstruction to achieve the purpose of denoising the sample through wavelet transformation, inputting the clean sample before and after the wavelet transformation into a voiceprint recognition network to obtain two model output probability vectors, then calculating cosine similarity values between corresponding vectors, setting a threshold value in the cosine similarity values, and selecting the threshold value by taking a false detection rate as a standard.
The antagonistic samples are detected according to a decision threshold: and performing wavelet transformation on the countermeasure sample, inputting the sample before and after the wavelet transformation into the voiceprint recognition model to obtain an output probability vector, obtaining a cosine similarity value, comparing the cosine similarity value with a decision threshold, and judging as the countermeasure sample if the cosine similarity value is smaller than the decision threshold.
Denoising neural network defense confrontation sample: firstly, a denoising neural network is trained, then the confrontation samples which are not detected in the previous step are input into the denoising network, a denoised sample is finally obtained, and then a large number of confrontation samples lose the confrontation. To achieve further defense.
The countermeasure sample of the invention is different from the normal sample, the countermeasure sample has more noise, after denoising, the change is larger compared with the clean sample, the change is reflected to the model, namely the change of the model output probability before and after denoising of the countermeasure sample is larger, the change can be measured by the cosine similarity value between vectors, the change of the output vector is small, the cosine similarity value is large, otherwise, the change is small. Output challenge samples can be detected using this feature. In order to realize further defense, the confrontation samples which are not detected after the detection step are further denoised, and the defense effect is improved. The invention enhances the safety of the voiceprint recognition model.
The invention has the advantages that: the method can accurately detect the countermeasure sample in the data, can further purify and defend the undetected countermeasure sample, effectively reduces the risk brought by the countermeasure sample, and enhances the safety of the voiceprint recognition model.
Drawings
FIG. 1 is a basic flow diagram of the process of the present invention.
FIG. 2 is a flow chart of the wavelet transform denoising of the present invention.
Fig. 3 is a diagram illustrating a structure of a DCCRN network according to the present invention.
The specific implementation mode is as follows:
the technical scheme of the invention is further explained by combining the attached drawings.
Example 1
A method for defending a vocal print recognition confrontation sample based on cosine similarity and voice denoising comprises the following steps:
(1) The data preprocessing step is carried out on the speaker voice signal:
first, data extraction is performed on an existing voice file (WAV format), and voice data extraction is performed on speaker voice by using a library audio processing python toolkit, as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file.
Normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,representing a class c tag. The normalized formula is:
whereinDenotes the normalized sample, max (X) i ) To representThe maximum value of d sampling points in the sample is normalized
(2) Building a voiceprint recognition model: the structure and parameters of the classification model are pre-specified and do not change. The classification model structure adopted by the invention mainly comprises a 1D convolutional layer, a maximum pooling layer, a batch normalization layer and a full connection layer, the specific structure is shown in Table 1, training is carried out by utilizing a training data set, and a voiceprint recognition classification model is as follows:
target model:
F target (. Cndot.) represents the probability vector of the model output.
(3) Designing a confrontation sample according to the voiceprint recognition classifier:
the challenge sample is defined as:
wherein, delta i Is a disturbance added to the original specimen.
The generation of the countermeasure sample is described here by taking an optimization-based countermeasure attack method as an example. The attack method based on optimization is essentially a countermeasure sample generation method based on gradient.
The optimization function is defined as:
(4) Calculating cosine similarity selection decision threshold in the clean samples according to wavelet transform:
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal f can be removed to the maximum extent i The correlation of (t) concentrates most of the energy on a few wavelet coefficients of large amplitude. And disturbance e i And (t) after wavelet transformation, the wavelet transformation is distributed on all time axes at all scales, and the amplitude is not very large. The purpose of noise reduction of the voice signal can be achieved by a threshold filtering method.
The wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform. Selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers.
For sample signal f i And (t) performing linear threshold processing on the wavelet transform coefficients. And processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer. The threshold formula is as follows:
where w is the wavelet coefficient and λ is the selected threshold.
And III, reconstructing the processed wavelet coefficient. Reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of the wavelet decomposition and the high-frequency coefficient from the first layer to the N layer after the processing, thereby obtaining a voice signal f after noise reduction i de (t) of (d). Then there are:
cleaning the original clean sampleAnd clean samples after wavelet transformObtaining corresponding output probability vector in input voiceprint recognition modelAndthen the cosine similarity value between the two vectors is calculated as
And then selecting a decision threshold from the cosine similarity value set C of the clean samples, wherein if the false detection rate is set to be b%, the decision threshold T is the value of the b-th% in sort (C) (sort (C) represents that C is sorted from small to large).
(5) Detecting the confrontation sample according to a decision threshold: will confront the sampleWavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the antagonistic sample with a decision threshold T if C 'is satisfied' i If < T, the determination is madeTo fight the sample.
(6) Denoising the undetected confrontation sample by using a denoising neural network:
firstly, a denoising neural network is trained, the denoising neural network adopted by the invention is DCCRN, the DCCRN firstly transforms the voice with noise through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then an input complex matrix I = I is defined r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules.
In the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is the complex spectrum of clean speech, Y = Y r +jY i Is a complex spectrum of noisy speech.
Given S and Y, M can be calculated by the following formula:
m is expressed in polar coordinates as:
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
wherein <, > represents the dot product between two vectors when estimating speechVery close to the clean speech S, e n ≈0。
After a DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
Example 2: data in actual experiments
(1) And (4) selecting experimental data.
The data set used in the experiment is an AISHELL-1 voice data set, which collects voices recorded by speakers in different age groups, different sexes and different regions in a quiet environment, and the sampling rate is 16000. We selected 20 human voices as the data set of the voiceprint recognition model, and the time domain data length of the original waveform we extracted for each sentence of voice was 60000. The data is preprocessed and stored as a data set of an array (butsize, 60000,1) and corresponding label data is generated, and the processed data sets are stored as npy files. The input data of the de-noising network DCCRN is the generated confrontation samples of FGSM, BIM, PGD, deepFool and CW, and the corresponding clean samples are output.
(2) And (5) determining parameters.
The threshold of thresholding in the wavelet transform, λ =0.02, is used to detect the decision threshold of the challenge sample, T =0.91955.
(3) And (5) experimental results.
The invention utilizes 5 attack algorithms (FGSM, BIM, PGD, deepFool, CW) to generate 5 confrontation samples, utilizes the defense method provided by the invention to defend the 5 confrontation samples, utilizes the defense success rate ACC and the false detection rate FPR as the defense effect of the defense method, and is compared with other defense (detection) methods, and the experimental result is shown in table 1.
TABLE 1 defense effects
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.
Claims (6)
1. A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising is characterized by comprising the following steps:
step 1: carrying out data preprocessing on a speaker voice signal;
and 2, step: building a voiceprint recognition model;
and 3, step 3: designing a confrontation sample according to the voiceprint model;
and 4, step 4: obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
and 5: according toA decision threshold to detect an antagonistic sample; will confront the sampleWavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the antagonistic sample with a decision threshold T if C 'is satisfied' i If < T, the determination is madeTo challenge the sample;
and 6: denoising neural network defense countermeasure samples.
2. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 1 specifically comprises:
firstly, data extraction is carried out on an existing voice file (WAV format), and voice data extraction is carried out on speaker voice by utilizing a library audio processing python toolkit, wherein the steps are as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file;
normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,represents a class c tag; the normalization formula is:
3. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 2 specifically comprises:
the structure and parameters of the classification model are specified in advance and do not change; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:
target model:
F target (. O) a probability vector representing the model output;
4. the method for defending a vocal print recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 3 specifically comprises:
the challenge sample is defined as:
wherein, delta i Is a perturbation added to the original sample;
the generation of the counterattack sample is described by taking an optimized counterattack method as an example; the attack method based on optimization is essentially a confrontation sample generation method based on gradient;
the optimization function is defined as:
5. the method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 4 specifically comprises:
obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal f can be removed to the maximum extent i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;
the wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform; selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers;
for sample signal f i Performing linear threshold processing on the wavelet transform coefficient of (t); for each layer height from the first layer to the N layersThe frequency coefficient is processed through a threshold function, and the low-frequency coefficient of each layer is not processed; the threshold formula is as follows:
wherein w is a wavelet coefficient and λ is a selected threshold;
reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction i de (t); then there are:
cleaning the original clean sampleAnd clean samples after wavelet transformObtaining corresponding output probability vector in input voiceprint recognition modelAndthen the cosine similarity value between the two vectors is calculated as
And then selecting a decision threshold from the cosine similarity value set C of the clean samples, wherein if the false detection rate is set to be b%, the decision threshold T is the value of the b-th% in sort (C) (sort (C) represents that C is sorted from small to large).
6. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 6 specifically comprises:
firstly, training a denoising neural network, wherein the adopted denoising neural network is DCCRN, the DCCRN firstly transforms the noisy voice through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then defines an input complex matrix I = I r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules;
in the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is clean speechY = Y complex frequency spectrum of r +jY i Is a complex spectrum of noisy speech;
given S and Y, M can be calculated by the following formula:
m is expressed in polar coordinates as:
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
wherein <, > represents the dot product between two vectors when estimating speechVery close to the clean speech S, e n ≈0;
After the DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210653117.4A CN115188384A (en) | 2022-06-09 | 2022-06-09 | Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210653117.4A CN115188384A (en) | 2022-06-09 | 2022-06-09 | Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115188384A true CN115188384A (en) | 2022-10-14 |
Family
ID=83513537
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210653117.4A Pending CN115188384A (en) | 2022-06-09 | 2022-06-09 | Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115188384A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117012204A (en) * | 2023-07-25 | 2023-11-07 | 贵州师范大学 | Defensive method for countermeasure sample of speaker recognition system |
CN117316187A (en) * | 2023-11-30 | 2023-12-29 | 山东同其万疆科技创新有限公司 | English teaching management system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564154A (en) * | 2020-03-23 | 2020-08-21 | 北京邮电大学 | Method and device for defending against sample attack based on voice enhancement algorithm |
CN113111945A (en) * | 2021-04-15 | 2021-07-13 | 东南大学 | Confrontation sample defense method based on transform self-encoder |
CN113378643A (en) * | 2021-05-14 | 2021-09-10 | 浙江工业大学 | Signal countermeasure sample detection method based on random transformation and wavelet reconstruction |
WO2021205746A1 (en) * | 2020-04-09 | 2021-10-14 | Mitsubishi Electric Corporation | System and method for detecting adversarial attacks |
CN114511018A (en) * | 2022-01-24 | 2022-05-17 | 中国人民解放军国防科技大学 | Countermeasure sample detection method and device based on intra-class adjustment cosine similarity |
-
2022
- 2022-06-09 CN CN202210653117.4A patent/CN115188384A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111564154A (en) * | 2020-03-23 | 2020-08-21 | 北京邮电大学 | Method and device for defending against sample attack based on voice enhancement algorithm |
WO2021205746A1 (en) * | 2020-04-09 | 2021-10-14 | Mitsubishi Electric Corporation | System and method for detecting adversarial attacks |
CN113111945A (en) * | 2021-04-15 | 2021-07-13 | 东南大学 | Confrontation sample defense method based on transform self-encoder |
CN113378643A (en) * | 2021-05-14 | 2021-09-10 | 浙江工业大学 | Signal countermeasure sample detection method based on random transformation and wavelet reconstruction |
CN114511018A (en) * | 2022-01-24 | 2022-05-17 | 中国人民解放军国防科技大学 | Countermeasure sample detection method and device based on intra-class adjustment cosine similarity |
Non-Patent Citations (1)
Title |
---|
徐东伟 等: "语音对抗攻击与防御方法综述", 信息安全学报, vol. 7, no. 1, 31 January 2022 (2022-01-31), pages 126 - 144 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117012204A (en) * | 2023-07-25 | 2023-11-07 | 贵州师范大学 | Defensive method for countermeasure sample of speaker recognition system |
CN117012204B (en) * | 2023-07-25 | 2024-04-09 | 贵州师范大学 | Defensive method for countermeasure sample of speaker recognition system |
CN117316187A (en) * | 2023-11-30 | 2023-12-29 | 山东同其万疆科技创新有限公司 | English teaching management system |
CN117316187B (en) * | 2023-11-30 | 2024-02-06 | 山东同其万疆科技创新有限公司 | English teaching management system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Novoselov et al. | STC anti-spoofing systems for the ASVspoof 2015 challenge | |
Fallah et al. | A new online signature verification system based on combining Mellin transform, MFCC and neural network | |
Hidayat et al. | Denoising speech for MFCC feature extraction using wavelet transformation in speech recognition system | |
Dennis et al. | Temporal coding of local spectrogram features for robust sound recognition | |
Rajaratnam et al. | Noise flooding for detecting audio adversarial examples against automatic speech recognition | |
CN109872720B (en) | Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network | |
WO2006024117A1 (en) | Method for automatic speaker recognition | |
Rajaratnam et al. | Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition | |
CN115188384A (en) | Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising | |
Wu et al. | Defense for black-box attacks on anti-spoofing models by self-supervised learning | |
Sun et al. | Ai-synthesized voice detection using neural vocoder artifacts | |
CN111312259B (en) | Voiceprint recognition method, system, mobile terminal and storage medium | |
CN116416997A (en) | Intelligent voice fake attack detection method based on attention mechanism | |
Kim et al. | Multifeature fusion-based earthquake event classification using transfer learning | |
Ahmad et al. | Automatic detection of tree cutting in forests using acoustic properties | |
Wu et al. | Adversarial sample detection for speaker verification by neural vocoders | |
Białobrzeski et al. | Robust Bayesian and light neural networks for voice spoofing detection | |
CN114640518B (en) | Personalized trigger back door attack method based on audio steganography | |
Maciejewski et al. | Neural networks for vehicle recognition | |
Alegre et al. | Evasion and obfuscation in automatic speaker verification | |
Raj et al. | Reconstruction of damaged spectrographic features for robust speech recognition. | |
Sailor et al. | Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection. | |
CN116229992A (en) | Voice lie detection method, device, medium and equipment | |
Gala et al. | Evaluating the effectiveness of attacks and defenses on machine learning through adversarial samples | |
Yu et al. | A multi-spike approach for robust sound recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |