Nothing Special   »   [go: up one dir, main page]

CN115188384A - Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising - Google Patents

Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising Download PDF

Info

Publication number
CN115188384A
CN115188384A CN202210653117.4A CN202210653117A CN115188384A CN 115188384 A CN115188384 A CN 115188384A CN 202210653117 A CN202210653117 A CN 202210653117A CN 115188384 A CN115188384 A CN 115188384A
Authority
CN
China
Prior art keywords
sample
voice
cosine similarity
denoising
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210653117.4A
Other languages
Chinese (zh)
Inventor
徐东伟
蒋斌
房若尘
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210653117.4A priority Critical patent/CN115188384A/en
Publication of CN115188384A publication Critical patent/CN115188384A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Complex Calculations (AREA)

Abstract

A voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising is characterized in that data preprocessing is firstly carried out, and speaker voice data are preprocessed; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the clean data is subjected to wavelet transformation in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.

Description

Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising
Technical Field
The invention relates to a method for defending a voiceprint recognition confrontation sample, and belongs to the field of deep learning safety.
Background
With the rapid development of deep learning, deep learning is one of the most common technologies of artificial intelligence, and people's lives are influenced and changed in various aspects, and typical applications include the fields of intelligent home, intelligent driving, voice recognition, voiceprint recognition and the like. Deep learning, as a very complex software system, can also face various hacking attacks. Hackers can also threaten property security, personal privacy, traffic security, and bulletin security through deep learning systems. Attacks against deep learning systems typically include the following. 1. And (3) stealing the model, wherein a hacker steals the model file deployed in the server by various advanced means. 2. Data virus injection, which is to inject abnormal data into a deep learning training sample, so that a model can generate classification errors when meeting certain conditions, for example, a back door attack algorithm is to add a back door mark into poisoning data, so that the model is poisoned. 3. A challenge sample, which refers to an input sample formed by deliberately adding a subtle perturbation to the data set, causes the model to give an erroneous output with high confidence. Briefly, the challenge sample makes the deep learning model misclassify by superimposing a well-constructed human-imperceptible perturbation on the element data. The safety of deep learning becomes a problem to be urgently solved today.
Defense methods are mainly divided into two categories: defense against samples and detection of against samples. The main purpose of defense of the confrontation sample is to restore the classification label of the confrontation sample to the label of the normal sample; the main purpose of the countermeasure sample detection is to find the countermeasure sample in the sample set and reject it.
Disclosure of Invention
The invention provides a method for defending a vocal print recognition countermeasure sample based on cosine similarity and voice denoising, which aims to overcome the defects in the prior art.
The technical method adopted by the invention to solve the technical problem is as follows: preprocessing data, namely preprocessing the voice data of the speaker used by the user; building a voiceprint recognition model; designing a countermeasure sample generator with malicious information samples by combining a voiceprint recognition model with several different countermeasure attack methods; performing wavelet transformation reconstruction on clean data, performing cosine similarity calculation on output probability vectors obtained by samples before and after the wavelet transformation of the clean data in a classification model, and setting a cosine similarity threshold; performing wavelet transformation reconstruction on the countermeasure sample, calculating cosine similarity values of output vectors before and after reconstruction, comparing the cosine similarity values with a cosine similarity threshold, wherein the countermeasure sample is smaller than the threshold and is not detected if the cosine similarity values are larger than the threshold; training a voice denoising neural network, inputting undetected confrontation samples into the denoising network for denoising, and removing the confrontation disturbance.
A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising comprises the following steps:
step 1: carrying out data preprocessing on a speaker voice signal;
step 2: building a voiceprint recognition model;
and step 3: designing a confrontation sample according to the voiceprint model;
and 4, step 4: obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
and 5: detecting an antagonistic sample according to a decision threshold; will confront the sample
Figure BDA0003686632470000031
Wavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the confrontation sample with a decision threshold T if C 'is satisfied' i If < T, the determination is made
Figure BDA0003686632470000032
To challenge the sample;
step 6: denoising the neural network defense confrontation sample;
further, step 1 specifically includes:
firstly, data extraction is carried out on an existing voice file (WAV format), and voice data extraction is carried out on speaker voice by utilizing a library audio processing python toolkit, wherein the steps are as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file;
normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,
Figure BDA0003686632470000033
represents a class c tag; the normalized formula is:
Figure BDA0003686632470000034
wherein
Figure BDA0003686632470000035
Denotes the normalized sample, max (X) i ) Representing the maximum value of d sampling points in the sample, and normalizing
Figure BDA0003686632470000036
Further, step 2 specifically includes:
predesignating the structure and parameters of the classification model, and not changing; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:
target model:
Figure BDA0003686632470000041
F target (. O) a probability vector representing the model output;
further, step 3 specifically includes:
the challenge sample is defined as:
Figure BDA0003686632470000042
wherein, delta i Is a perturbation added to the original sample;
the generation of the counterattack sample is described by taking an optimized counterattack method as an example; the attack method based on optimization is essentially a confrontation sample generation method based on gradient;
the optimization function is defined as:
Figure BDA0003686632470000043
Figure BDA0003686632470000044
further, step 4 specifically includes:
obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal can beMaximum removal of signal f i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;
the wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform; selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers;
for sample signal f i Performing linear threshold processing on the wavelet transform coefficient of (t); processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer; the threshold formula is as follows:
Figure BDA0003686632470000051
wherein w is a wavelet coefficient and λ is a selected threshold;
III, reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction i de (t); then there are:
Figure BDA0003686632470000052
cleaning the original clean sample
Figure BDA0003686632470000053
And clean samples after wavelet transform
Figure BDA0003686632470000054
Obtaining corresponding output probability vector in input voiceprint recognition model
Figure BDA0003686632470000055
And
Figure BDA0003686632470000056
then the cosine similarity value between the two vectors is calculated as
Figure BDA0003686632470000057
Then, selecting a decision threshold value from the cosine similarity value set C of the clean sample, and if the false detection rate is set to be b%, the decision threshold value T is the value of the b-th% in sort (C) indicates that C is sorted from small to large);
further, step 6 specifically includes:
firstly, training a denoising neural network, wherein the adopted denoising neural network is DCCRN, the DCCRN firstly transforms the noisy voice through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then defines an input complex matrix I = I r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules;
in the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is the complex spectrum of clean speech, Y = Y r +jY i Complex frequency spectrum of noisy speech;
given S and Y, M can be calculated by the following formula:
Figure BDA0003686632470000061
m is expressed in polar coordinates as:
Figure BDA0003686632470000062
Figure BDA0003686632470000063
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
Figure BDA0003686632470000071
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
Figure BDA0003686632470000072
Figure BDA0003686632470000073
Figure BDA0003686632470000074
wherein <, > represents the dot product between two vectors when estimating speech
Figure BDA0003686632470000075
Very close to the clean speech S, e n ≈0;
After the DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
The working principle of the invention is as follows:
data preprocessing of the used speech data set: and acquiring the time domain data of the original waveform of each section of voice, dividing the data into a training set and a test set, and performing normalization processing.
Building a voiceprint recognition model: the structure and parameters of the voiceprint recognition model are specified in advance and do not change any more. A data set suitable for the recognition model is also given, namely a speaker voice sample which comprises input time domain waveform data used for speaker recognition and a corresponding classification label, and the sample set in the data set can be predicted and output by the model with high precision.
Designing a confrontation sample according to a voiceprint recognition model: several commonly used white-box counterattack methods were chosen. And adjusting the gradient direction of the input data according to the parameters of the voiceprint recognition model, so that the voiceprint recognition model generates an error label under the condition that the input sample changes slightly.
Obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method: the method comprises the steps of performing wavelet decomposition on a clean sample, then performing wavelet reconstruction to achieve the purpose of denoising the sample through wavelet transformation, inputting the clean sample before and after the wavelet transformation into a voiceprint recognition network to obtain two model output probability vectors, then calculating cosine similarity values between corresponding vectors, setting a threshold value in the cosine similarity values, and selecting the threshold value by taking a false detection rate as a standard.
The antagonistic samples are detected according to a decision threshold: and performing wavelet transformation on the countermeasure sample, inputting the sample before and after the wavelet transformation into the voiceprint recognition model to obtain an output probability vector, obtaining a cosine similarity value, comparing the cosine similarity value with a decision threshold, and judging as the countermeasure sample if the cosine similarity value is smaller than the decision threshold.
Denoising neural network defense confrontation sample: firstly, a denoising neural network is trained, then the confrontation samples which are not detected in the previous step are input into the denoising network, a denoised sample is finally obtained, and then a large number of confrontation samples lose the confrontation. To achieve further defense.
The countermeasure sample of the invention is different from the normal sample, the countermeasure sample has more noise, after denoising, the change is larger compared with the clean sample, the change is reflected to the model, namely the change of the model output probability before and after denoising of the countermeasure sample is larger, the change can be measured by the cosine similarity value between vectors, the change of the output vector is small, the cosine similarity value is large, otherwise, the change is small. Output challenge samples can be detected using this feature. In order to realize further defense, the confrontation samples which are not detected after the detection step are further denoised, and the defense effect is improved. The invention enhances the safety of the voiceprint recognition model.
The invention has the advantages that: the method can accurately detect the countermeasure sample in the data, can further purify and defend the undetected countermeasure sample, effectively reduces the risk brought by the countermeasure sample, and enhances the safety of the voiceprint recognition model.
Drawings
FIG. 1 is a basic flow diagram of the process of the present invention.
FIG. 2 is a flow chart of the wavelet transform denoising of the present invention.
Fig. 3 is a diagram illustrating a structure of a DCCRN network according to the present invention.
The specific implementation mode is as follows:
the technical scheme of the invention is further explained by combining the attached drawings.
Example 1
A method for defending a vocal print recognition confrontation sample based on cosine similarity and voice denoising comprises the following steps:
(1) The data preprocessing step is carried out on the speaker voice signal:
first, data extraction is performed on an existing voice file (WAV format), and voice data extraction is performed on speaker voice by using a library audio processing python toolkit, as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file.
Normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,
Figure BDA0003686632470000091
representing a class c tag. The normalized formula is:
Figure BDA0003686632470000101
wherein
Figure BDA0003686632470000102
Denotes the normalized sample, max (X) i ) To representThe maximum value of d sampling points in the sample is normalized
Figure BDA0003686632470000103
(2) Building a voiceprint recognition model: the structure and parameters of the classification model are pre-specified and do not change. The classification model structure adopted by the invention mainly comprises a 1D convolutional layer, a maximum pooling layer, a batch normalization layer and a full connection layer, the specific structure is shown in Table 1, training is carried out by utilizing a training data set, and a voiceprint recognition classification model is as follows:
target model:
Figure BDA0003686632470000104
F target (. Cndot.) represents the probability vector of the model output.
(3) Designing a confrontation sample according to the voiceprint recognition classifier:
the challenge sample is defined as:
Figure BDA0003686632470000105
wherein, delta i Is a disturbance added to the original specimen.
The generation of the countermeasure sample is described here by taking an optimization-based countermeasure attack method as an example. The attack method based on optimization is essentially a countermeasure sample generation method based on gradient.
The optimization function is defined as:
Figure BDA0003686632470000106
Figure BDA0003686632470000107
(4) Calculating cosine similarity selection decision threshold in the clean samples according to wavelet transform:
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal f can be removed to the maximum extent i The correlation of (t) concentrates most of the energy on a few wavelet coefficients of large amplitude. And disturbance e i And (t) after wavelet transformation, the wavelet transformation is distributed on all time axes at all scales, and the amplitude is not very large. The purpose of noise reduction of the voice signal can be achieved by a threshold filtering method.
The wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform. Selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers.
For sample signal f i And (t) performing linear threshold processing on the wavelet transform coefficients. And processing the high-frequency coefficient of each layer from the first layer to the N layers through a threshold function, and not processing the low-frequency coefficient of each layer. The threshold formula is as follows:
Figure BDA0003686632470000111
where w is the wavelet coefficient and λ is the selected threshold.
And III, reconstructing the processed wavelet coefficient. Reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of the wavelet decomposition and the high-frequency coefficient from the first layer to the N layer after the processing, thereby obtaining a voice signal f after noise reduction i de (t) of (d). Then there are:
Figure BDA0003686632470000112
cleaning the original clean sample
Figure BDA0003686632470000113
And clean samples after wavelet transform
Figure BDA0003686632470000114
Obtaining corresponding output probability vector in input voiceprint recognition model
Figure BDA0003686632470000115
And
Figure BDA0003686632470000116
then the cosine similarity value between the two vectors is calculated as
Figure BDA0003686632470000117
And then selecting a decision threshold from the cosine similarity value set C of the clean samples, wherein if the false detection rate is set to be b%, the decision threshold T is the value of the b-th% in sort (C) (sort (C) represents that C is sorted from small to large).
(5) Detecting the confrontation sample according to a decision threshold: will confront the sample
Figure BDA0003686632470000122
Wavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the antagonistic sample with a decision threshold T if C 'is satisfied' i If < T, the determination is made
Figure BDA0003686632470000121
To fight the sample.
(6) Denoising the undetected confrontation sample by using a denoising neural network:
firstly, a denoising neural network is trained, the denoising neural network adopted by the invention is DCCRN, the DCCRN firstly transforms the voice with noise through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then an input complex matrix I = I is defined r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules.
In the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is the complex spectrum of clean speech, Y = Y r +jY i Is a complex spectrum of noisy speech.
Given S and Y, M can be calculated by the following formula:
Figure BDA0003686632470000131
m is expressed in polar coordinates as:
Figure BDA0003686632470000132
Figure BDA0003686632470000133
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
Figure BDA0003686632470000134
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
Figure BDA0003686632470000135
Figure BDA0003686632470000136
Figure BDA0003686632470000137
wherein <, > represents the dot product between two vectors when estimating speech
Figure BDA0003686632470000138
Very close to the clean speech S, e n ≈0。
After a DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
Example 2: data in actual experiments
(1) And (4) selecting experimental data.
The data set used in the experiment is an AISHELL-1 voice data set, which collects voices recorded by speakers in different age groups, different sexes and different regions in a quiet environment, and the sampling rate is 16000. We selected 20 human voices as the data set of the voiceprint recognition model, and the time domain data length of the original waveform we extracted for each sentence of voice was 60000. The data is preprocessed and stored as a data set of an array (butsize, 60000,1) and corresponding label data is generated, and the processed data sets are stored as npy files. The input data of the de-noising network DCCRN is the generated confrontation samples of FGSM, BIM, PGD, deepFool and CW, and the corresponding clean samples are output.
(2) And (5) determining parameters.
The threshold of thresholding in the wavelet transform, λ =0.02, is used to detect the decision threshold of the challenge sample, T =0.91955.
(3) And (5) experimental results.
The invention utilizes 5 attack algorithms (FGSM, BIM, PGD, deepFool, CW) to generate 5 confrontation samples, utilizes the defense method provided by the invention to defend the 5 confrontation samples, utilizes the defense success rate ACC and the false detection rate FPR as the defense effect of the defense method, and is compared with other defense (detection) methods, and the experimental result is shown in table 1.
TABLE 1 defense effects
Figure BDA0003686632470000141
The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims (6)

1. A voiceprint recognition confrontation sample defense method based on cosine similarity and voice denoising is characterized by comprising the following steps:
step 1: carrying out data preprocessing on a speaker voice signal;
and 2, step: building a voiceprint recognition model;
and 3, step 3: designing a confrontation sample according to the voiceprint model;
and 4, step 4: obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
and 5: according toA decision threshold to detect an antagonistic sample; will confront the sample
Figure FDA0003686632460000012
Wavelet transformation is carried out, and then cosine similarity value C 'is obtained' i Comparing the cosine similarity value of the antagonistic sample with a decision threshold T if C 'is satisfied' i If < T, the determination is made
Figure FDA0003686632460000013
To challenge the sample;
and 6: denoising neural network defense countermeasure samples.
2. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 1 specifically comprises:
firstly, data extraction is carried out on an existing voice file (WAV format), and voice data extraction is carried out on speaker voice by utilizing a library audio processing python toolkit, wherein the steps are as follows:
X i ,sr=librosa(T i ,sr=None),i=1,2,...n+m (1)
wherein X i Is the voice data of the ith speaker voice file extracted, sr is the sampling rate of the voice data, T i Is the ith speaker voice file;
normalizing the speech signal data set and dividing the data set D into training sets D train And test set D test Wherein the data set
D train ={(X 1 ,Y 1 ),(X 2 ,Y 2 ),...,(X n ,Y n )}
D test ={(X n+1 ,Y n+1 ),(X n+2 ,Y n+2 ),...,(X n+m ,Y n+m )}
X i =(X i1 ,X i2 ,...,X id ) D represents X i The length of the data of (a) is,
Figure FDA0003686632460000011
represents a class c tag; the normalization formula is:
Figure FDA0003686632460000021
wherein
Figure FDA0003686632460000022
Denotes the normalized sample, max (X) i ) Representing the maximum value of d sampling points in the sample, and normalizing
Figure FDA0003686632460000023
3. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 2 specifically comprises:
the structure and parameters of the classification model are specified in advance and do not change; the classification model structure adopted by the invention mainly comprises a 1D convolution layer, a maximum pooling layer, a batch normalization layer and a full-connection layer. The specific structure is shown in table 1, training is performed by using a training data set, and the voiceprint recognition classification model is as follows:
target model:
Figure FDA0003686632460000024
F target (. O) a probability vector representing the model output;
4. the method for defending a vocal print recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 3 specifically comprises:
the challenge sample is defined as:
Figure FDA0003686632460000025
wherein, delta i Is a perturbation added to the original sample;
the generation of the counterattack sample is described by taking an optimized counterattack method as an example; the attack method based on optimization is essentially a confrontation sample generation method based on gradient;
the optimization function is defined as:
Figure FDA0003686632460000031
Figure FDA0003686632460000032
5. the method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein the step 4 specifically comprises:
obtaining a detection threshold value according to the clean data and a wavelet transformation reconstruction method;
suppression of disturbances e i (t) obtaining a true signal f i (t) after wavelet transform, the signal f can be removed to the maximum extent i (t) correlation, concentrating most of the energy on a few wavelet coefficients of larger magnitude; and disturbance e i (t) after wavelet transform, the wavelet transform is distributed on all time axes of all scales, and the amplitude is not very large; the purpose of noise reduction of the voice signal can be achieved by a threshold filtering method;
the wavelet threshold denoising steps are as follows:
I. for speech signal f i (t) performing wavelet transform; selecting an orthogonal wavelet and the number N of decomposition layers, for signal f i (t) performing wavelet decomposition of N layers;
for sample signal f i Performing linear threshold processing on the wavelet transform coefficient of (t); for each layer height from the first layer to the N layersThe frequency coefficient is processed through a threshold function, and the low-frequency coefficient of each layer is not processed; the threshold formula is as follows:
Figure FDA0003686632460000033
wherein w is a wavelet coefficient and λ is a selected threshold;
reconstructing the processed wavelet coefficient; reconstructing a voice signal according to the low-frequency coefficient of the Nth layer of wavelet decomposition and the processed high-frequency coefficients from the first layer to the N layers, thereby obtaining a voice signal f after noise reduction i de (t); then there are:
Figure FDA0003686632460000034
cleaning the original clean sample
Figure FDA0003686632460000041
And clean samples after wavelet transform
Figure FDA0003686632460000042
Obtaining corresponding output probability vector in input voiceprint recognition model
Figure FDA0003686632460000043
And
Figure FDA0003686632460000044
then the cosine similarity value between the two vectors is calculated as
Figure FDA0003686632460000045
And then selecting a decision threshold from the cosine similarity value set C of the clean samples, wherein if the false detection rate is set to be b%, the decision threshold T is the value of the b-th% in sort (C) (sort (C) represents that C is sorted from small to large).
6. The method for defending a voiceprint recognition countermeasure sample based on cosine similarity and speech denoising as claimed in claim 1, wherein step 6 specifically comprises:
firstly, training a denoising neural network, wherein the adopted denoising neural network is DCCRN, the DCCRN firstly transforms the noisy voice through short-time Fourier transform to obtain a voice complex frequency spectrum with a real part and an imaginary part, and then defines an input complex matrix I = I r +jI i Complex convolution filter W = W r +jW i Wherein the matrix W r And W i Representing the real and imaginary parts of a complex convolution kernel, the output characteristic F of the complex layer out Comprises the following steps:
F out =(I r *W r -I i *W i )+j(I r *W i +I i *W r ) (10)
similar to complex convolution, given a complex input X r And X i Real and imaginary parts of, the output L of the complex LSTM out Can be defined as:
L rr =LSTM r (X r );L ii =LSTM i (X i ) (11)
L ri =LSTM i (X r );L ir =LSTM r (X i ) (12)
L out =(L rr -L ii )+(L ri +L ir ) (13)
wherein LSTM r And LSTM i Are the real and imaginary parts of two conventional LSTM modules;
in the process of training the network, the purpose of DCCRN is to optimize a complex mask matrix M = M r +jM i The denoised voice can be obtained after the noisy voice passes through the mask matrix
S=M*Y (14)
Wherein S = S r +jS i This is clean speechY = Y complex frequency spectrum of r +jY i Is a complex spectrum of noisy speech;
given S and Y, M can be calculated by the following formula:
Figure FDA0003686632460000051
m is expressed in polar coordinates as:
Figure FDA0003686632460000052
Figure FDA0003686632460000053
M phase =arctan2(M i ,M r ) (18)
with this noisy speech can be estimated as:
Figure FDA0003686632460000054
the loss function of DCCRN is SI-SNR, and the calculation formula is as follows:
Figure FDA0003686632460000055
Figure FDA0003686632460000056
Figure FDA0003686632460000057
wherein <, > represents the dot product between two vectors when estimating speech
Figure FDA0003686632460000058
Very close to the clean speech S, e n ≈0;
After the DCCRN denoising network is trained, inputting undetected confrontation samples into the denoising network to obtain denoised voice, and restoring the denoised voice to an original correct label.
CN202210653117.4A 2022-06-09 2022-06-09 Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising Pending CN115188384A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210653117.4A CN115188384A (en) 2022-06-09 2022-06-09 Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210653117.4A CN115188384A (en) 2022-06-09 2022-06-09 Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising

Publications (1)

Publication Number Publication Date
CN115188384A true CN115188384A (en) 2022-10-14

Family

ID=83513537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210653117.4A Pending CN115188384A (en) 2022-06-09 2022-06-09 Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising

Country Status (1)

Country Link
CN (1) CN115188384A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012204A (en) * 2023-07-25 2023-11-07 贵州师范大学 Defensive method for countermeasure sample of speaker recognition system
CN117316187A (en) * 2023-11-30 2023-12-29 山东同其万疆科技创新有限公司 English teaching management system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564154A (en) * 2020-03-23 2020-08-21 北京邮电大学 Method and device for defending against sample attack based on voice enhancement algorithm
CN113111945A (en) * 2021-04-15 2021-07-13 东南大学 Confrontation sample defense method based on transform self-encoder
CN113378643A (en) * 2021-05-14 2021-09-10 浙江工业大学 Signal countermeasure sample detection method based on random transformation and wavelet reconstruction
WO2021205746A1 (en) * 2020-04-09 2021-10-14 Mitsubishi Electric Corporation System and method for detecting adversarial attacks
CN114511018A (en) * 2022-01-24 2022-05-17 中国人民解放军国防科技大学 Countermeasure sample detection method and device based on intra-class adjustment cosine similarity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564154A (en) * 2020-03-23 2020-08-21 北京邮电大学 Method and device for defending against sample attack based on voice enhancement algorithm
WO2021205746A1 (en) * 2020-04-09 2021-10-14 Mitsubishi Electric Corporation System and method for detecting adversarial attacks
CN113111945A (en) * 2021-04-15 2021-07-13 东南大学 Confrontation sample defense method based on transform self-encoder
CN113378643A (en) * 2021-05-14 2021-09-10 浙江工业大学 Signal countermeasure sample detection method based on random transformation and wavelet reconstruction
CN114511018A (en) * 2022-01-24 2022-05-17 中国人民解放军国防科技大学 Countermeasure sample detection method and device based on intra-class adjustment cosine similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐东伟 等: "语音对抗攻击与防御方法综述", 信息安全学报, vol. 7, no. 1, 31 January 2022 (2022-01-31), pages 126 - 144 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117012204A (en) * 2023-07-25 2023-11-07 贵州师范大学 Defensive method for countermeasure sample of speaker recognition system
CN117012204B (en) * 2023-07-25 2024-04-09 贵州师范大学 Defensive method for countermeasure sample of speaker recognition system
CN117316187A (en) * 2023-11-30 2023-12-29 山东同其万疆科技创新有限公司 English teaching management system
CN117316187B (en) * 2023-11-30 2024-02-06 山东同其万疆科技创新有限公司 English teaching management system

Similar Documents

Publication Publication Date Title
Novoselov et al. STC anti-spoofing systems for the ASVspoof 2015 challenge
Fallah et al. A new online signature verification system based on combining Mellin transform, MFCC and neural network
Hidayat et al. Denoising speech for MFCC feature extraction using wavelet transformation in speech recognition system
Dennis et al. Temporal coding of local spectrogram features for robust sound recognition
Rajaratnam et al. Noise flooding for detecting audio adversarial examples against automatic speech recognition
CN109872720B (en) Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network
WO2006024117A1 (en) Method for automatic speaker recognition
Rajaratnam et al. Isolated and ensemble audio preprocessing methods for detecting adversarial examples against automatic speech recognition
CN115188384A (en) Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising
Wu et al. Defense for black-box attacks on anti-spoofing models by self-supervised learning
Sun et al. Ai-synthesized voice detection using neural vocoder artifacts
CN111312259B (en) Voiceprint recognition method, system, mobile terminal and storage medium
CN116416997A (en) Intelligent voice fake attack detection method based on attention mechanism
Kim et al. Multifeature fusion-based earthquake event classification using transfer learning
Ahmad et al. Automatic detection of tree cutting in forests using acoustic properties
Wu et al. Adversarial sample detection for speaker verification by neural vocoders
Białobrzeski et al. Robust Bayesian and light neural networks for voice spoofing detection
CN114640518B (en) Personalized trigger back door attack method based on audio steganography
Maciejewski et al. Neural networks for vehicle recognition
Alegre et al. Evasion and obfuscation in automatic speaker verification
Raj et al. Reconstruction of damaged spectrographic features for robust speech recognition.
Sailor et al. Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection.
CN116229992A (en) Voice lie detection method, device, medium and equipment
Gala et al. Evaluating the effectiveness of attacks and defenses on machine learning through adversarial samples
Yu et al. A multi-spike approach for robust sound recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination