CN109872720B

CN109872720B - Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Info

Publication number: CN109872720B
Application number: CN201910085725.8A
Authority: CN
Inventors: 王泳; 赵雅珺; 张梦鸽
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-11-22
Anticipated expiration: 2039-01-29
Also published as: CN109872720A

Abstract

The invention discloses a re-recorded voice detection algorithm for different scene robustness based on a convolutional neural network, and particularly relates to the field of voice detection algorithms. The invention adopts the time-frequency diagram as the data input form of the network, and compared with the direct input of voice data, the time-frequency diagram has relatively dense distribution on the characteristic information introduced by the re-recording equipment, thereby being more beneficial to the characteristic extraction of the neural network, accelerating the training and improving the precision.

Description

Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Technical Field

The invention relates to the field of voice detection algorithms, in particular to a re-recorded voice detection algorithm for different scene robustness based on a convolutional neural network.

Background

Research has proved that fraudulent voices such as Voice Conversion (VC), voice Synthesis (SS), and re-recorded voices can effectively deceive a Speaker Recognition (ASV) system, thereby impersonating others to log in the system, and re-recorded voices can cause the ASV system to generate a higher false acceptance rate and seriously threaten social security. VC and SS need more voice information and characteristics of target speakers, and the existing algorithm is not completely mature, so that the realization cost and difficulty are relatively high; the re-recorded voice can be easily obtained by using a cheap recording device, and basically contains all the characteristics of the voice of the target character, so the re-recorded voice has more threat to VC and SS. For this reason, detection of the re-recorded voice should be regarded as important.

SV (automatic speaker recognition) systems are increasingly used in practice, for example: access control systems, telephone banking, military, etc. ASV systems are very vulnerable to fraudulent voice attacks because the speaker verification process does not require any face-to-face contact. Fraudulent speech produced by audio equipment can pose a threat to ASV (automatic speaker recognition) systems, affecting the security performance of the systems. In the last decade, not only are audio digital products endlessly graded in types, but also various products are increasingly multifunctional and stronger. The same or similar effects can now be achieved using relatively inexpensive devices such as personal computers equipped with audio processing software or PDAs with audio processing capabilities. For example, a high quality, low cost recording device, a smart phone, that creates fraudulent speech, can pose a risk to ASV systems. Fraudulent speech includes replay attacks, voice conversion, speech synthesis, etc. An attacker can acquire illegal identity access to the system by using the deceptive voice counterfeiting feature data, and then the file data and the privacy of the user can be stolen, so that a lot of irreparable loss is brought. Where replay attacks are more threatening with respect to speech conversion and speech synthesis. Replay attacks are speech samples taken from the actual target speaker in the form of continuous pre-recorded speech samples. Replay-based spoofing attacks do not require any technical processing of the speech, and the actual targeted speaker's speech and the replayed speech have identical spectral and high-level characteristics, which are the easiest types of speech attacks. The synthesized voice and the deformed voice have certain errors and changes relative to the voice of the actual target speaker, and are not completely the same, so that the detection of replay attack has greater difficulty relative to the synthesized voice and the deformed voice.

Disclosure of Invention

In order to overcome the above defects in the prior art, embodiments of the present invention provide a re-recorded voice detection algorithm based on convolutional neural network for different scene robustness, and by using a time-frequency diagram as a data input form of the network in the present invention, compared with directly inputting voice data, the time-frequency diagram has relatively dense distribution for feature information introduced by a re-recording device, and is more beneficial to neural network feature extraction, thereby accelerating training, improving accuracy, and having very high accuracy for detection of re-recorded voices of different recording devices, recording environments, and recording distances.

In order to achieve the purpose, the invention provides the following technical scheme: a re-recorded voice detection algorithm based on a convolutional neural network for different scene robustness specifically comprises the following steps:

a. acquiring original voice by using a recording device, and obtaining re-recorded voice through DA/AD conversion;

b. the original voice generates distortion in the transformation process, and the distortion data of the original voice is calculated through a distortion model, wherein the expression of the distortion model is as follows:

y (t) is the re-recorded speech, x (t) is the original speech, λ is the amplitude transformation factor, α is the time axis linear scaling factor, η is the superposition noise;

the corresponding frequency domain variation expression:

y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for a fixed recording device, namely lambda and alpha are constants;

c. re-recording the voice to produce a voice time-frequency diagram by short-time Fourier transform;

d. inputting the voice time-frequency diagram into an algorithm model, wherein the algorithm model comprises seven layers, each layer comprises a convolution layer and a pooling layer, the output of the convolution layer passes through a linear rectification function, residual errors are added between the layers for connection, finally, the final characteristics are extracted through global pooling, and the detection result is predicted through sigmoid.

In a preferred embodiment, when transforming the re-recorded speech, the short-time fourier transform uses a 126-length hamming (hanning) window with a step size of 50 and a time-frequency plot size of (64 x 62).

In a preferred embodiment, the algorithm model is convolved in frequency dimension and pooled in time dimension, and is specifically configured to be convolved in 1x2 by using a 3x1 convolution kernel, and is matched with the characteristic distribution characteristics of a time-frequency diagram, and the distribution characteristics of the time-frequency diagram have independence between adjacent speech frames and consistency in a specific frequency band.

In a preferred embodiment, the algorithmic model employs deep learning as a data-driven technique.

In a preferred embodiment, the re-recording device will introduce variations in the frequency domain of the original audio signal, and the deep learning model takes the original audio signal as input data to the network.

In a preferred embodiment, the algorithm model does not consider the correlation of the time dimension when performing the convolution of the frequency dimension, and performs the pooling of the time dimension simultaneously when performing the convolution of the frequency dimension.

In a preferred embodiment, the convolution kernel can be shared by parameters, the time dimension has characteristic information of devices distributed in the same way, the convolution kernel parameters are trained repeatedly, the pooling layer adopts pooling (1 x 2) of the time dimension, and the frequency dimension does not carry out pooling.

The invention has the technical effects and advantages that:

1. the time-frequency graph is used as a data input form of the network, and compared with the mode of directly inputting voice data, the time-frequency graph has relatively dense distribution on feature information introduced by the re-recording equipment, so that the neural network feature extraction is facilitated, the training is accelerated, and the precision is improved;

2. the method adopts convolution in frequency dimension and pooling in time dimension, specifically adopts 3x1 convolution kernel and 1x2 pooling, performs convolution only in frequency dimension, does not consider correlation of time dimension, and can greatly reduce parameter quantity of convolution kernel, so that a model has stronger overfitting resistance, reduce excessive dependence on data quantity, and meanwhile, in the training process, due to parameter sharing of convolution kernel, characteristic information of equipment with the same distribution in time dimension repeatedly trains convolution kernel parameters, so that training can be more sufficient;

3. according to the method, the specific one or more characteristics do not need to be manually selected and then classified by using the classifier as in the traditional machine learning method, and the relevant characteristics including some shallow-layer edge characteristics and deep-layer characteristics can be spontaneously extracted and then classified, so that the whole process is simplified and a better effect is achieved;

4. the algorithm of the invention has high accuracy in detecting the re-recorded voice of different recording devices, recording environments and recording distances.

Drawings

FIG. 1 is a schematic diagram of an algorithm model structure according to the present invention.

Fig. 2 is a schematic diagram of a speech re-recording process according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

As shown in fig. 1, the re-recorded speech detection algorithm based on convolutional neural network robustness to different scenes has a total of 7 layers of algorithm models, each layer comprises a convolutional layer and a pooling layer, the output of the convolutional layer passes through a linear rectification function, residual errors are added between the layers to be connected, finally, the final characteristics are extracted through global pooling, the detection result is predicted through sigmoid, frequency-dimension convolution and time-dimension pooling are adopted, specifically, 3x1 convolutional kernels and 1x2 pooling are adopted, the capacity of the model is maximally reduced, the risk of overfitting is greatly reduced, the dependence of the model on data volume is reduced, the dependence of the model on the data volume is highly matched with the characteristic distribution characteristics of a time-frequency graph, the training parameters are distributed to a more reasonable place, and more compact parameters are trained by using more effective characteristics;

the voice time-frequency diagram is generated by short-time Fourier transform, and compared with the voice data which is directly input, the time-frequency diagram has relatively dense distribution on the characteristic information introduced by the re-recording equipment, so that the neural network characteristic extraction is more facilitated, the training is accelerated, the precision is improved, the re-recording equipment can introduce variation on the frequency domain of the original voice signal, the performance of a deep learning model has extremely high dependence on the data, the original audio signal is used as the input data of the network, the characteristic distribution is too sparse, and the difficulty of extracting effective characteristics by the neural network is greatly improved;

example 2

As shown in fig. 2, in a re-recording speech detection algorithm based on a convolutional neural network and robust to different scenes, re-recording causes a certain degree of distortion of speech data, including amplitude distortion and linear expansion and contraction on a time axis, where a distortion model expression is:

the corresponding frequency domain variation expression:

y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for fixed recording equipment, namely lambda and alpha are constants;

example 3

In this implementation, a 0.2 second voice segment is used as experimental data, a 126-length hamming (hanning) window is used for short-time fourier transform, the step length is 50, and the size of a time-frequency diagram is (64 × 62);

furthermore, in the above technical solution, convolution is performed in the frequency dimension, pooling is performed in the time dimension, convolution is performed only in the frequency dimension, correlation of the time dimension is not considered, and parameter quantity of a convolution kernel can be greatly reduced, so that a model has stronger over-fitting resistance, excessive dependence on data quantity is reduced, and meanwhile, in the training process, due to parameter sharing of the convolution kernel, characteristic information of equipment with the same distribution in the time dimension repeatedly trains convolution kernel parameters, so that training is more sufficient, pooling (1 x 2) of the time dimension is performed in a pooling layer, pooling of the frequency dimension is not performed, pooling can reduce the dimension of the characteristic, accelerate calculation of a network, and enable a network structure to have stronger robustness on data characteristic expansion and deformation, for a time-frequency diagram, characteristic distribution does not have expansion and deformation, pooling is performed only in the time dimension, so that characteristic convolution is reduced, and loss of frequency dimension characteristics is not caused, and finally, the characteristic dimension becomes one-dimensional through multi-layer and pooling calculation, and length is the same as the frequency of the time-frequency diagram;

further, in the above technical scheme, the original voice library is composed of 30000 voice recorded by 60 persons in total, the sampling frequency is 16kHz, and the quantization precision is 16bits;

randomly selecting the voices of 10 speakers as test data, and using the voices of the other 50 speakers for training, so that the independence of the training data and the test data is ensured, and the recording of the same speaker is prevented from appearing in different data sets;

the specific recording process is as follows: for the training set, the original speech library is re-recorded 4 times by different distances and equipment combinations under a quiet environment, so that 4 re-recorded speech libraries are obtained, each of the 4 re-recorded speech libraries comprises 25000 sections of speech, 25000 sections of speech in total are randomly extracted from the 4 speech libraries to serve as negative samples, and the negative samples and the original speech form 50000 sections of training data sets. Original voice is played through a portable computer association Y40-70 AT-IFI; the re-recording equipment is portable computer Dall Inspion Lingye 14 (Ins 14 VD-258) and smart phone millet 2S;

the 4 recordings are shown in table 1:

TABLE 1 recording of speech

For test data, adopting the same recording setting of the second table, in order to verify the robustness of the voice interfered by the model to the environmental random noise, respectively recording the voice in a quiet environment and an environment with certain random noise, wherein the test set comprises 4 voice libraries in total, and each voice library comprises 10000 test voices in total in the quiet environment and the environment noise under the library recording mode;

further, in the above technical solution, the network error function is a cross entropy loss function, training is performed by using Adam optimization algorithm, the initial learning rate is set to 0.001, the learning rate is dynamically adjusted in the training process, the learning rate is reduced by one time every 10000 times of training, the batch size of each training is 32, in order to supervise the training effect in the training process, 2000 pieces of data are randomly selected from the training data for verification, and by comparing the training data loss function with the verification data loss function, a regularization term is added to the loss function and a regularization coefficient is set to 0.0001, so that overfitting can be effectively prevented;

table 2 lists some important hyper-parameter settings in the training process, under which the network converges rapidly during the training process and finally achieves a rather high accuracy;

TABLE 2 hyper-parameter (. Beta.) ₁ 、β ₂ Adam optimizer parameters respectively)

Further, in the above technical solution, this embodiment includes 4 experimental tests, which are performed for different recording devices and different recording distances, and an experimental result obtained in each experiment is shown in table 3:

TABLE 3 results of the experiment

The test accuracy under different conditions can reach more than 99.8%, and the experimental model has good universality.

And finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims

1. A re-recorded voice detection algorithm based on a convolutional neural network and robust to different scenes is characterized by comprising the following steps:

b. the original voice can generate distortion in the transformation process, and the distortion data of the original voice is calculated through a distortion model, wherein the expression of the distortion model is as follows:

the corresponding frequency domain variation expression:

Y(jω)＝λX(jαω)+N(jω)，

c. re-recording the voice and producing a voice time-frequency graph by short-time Fourier transform;

d. inputting a voice time-frequency graph into an algorithm model, wherein the algorithm model comprises seven layers, each layer comprises a convolution layer and a pooling layer, the output of the convolution layer passes through a linear rectification function, residual connection is added between the layers, and finally, the final characteristics are extracted through global pooling, and the detection result is predicted through sigmoid;

convolution layers of the algorithm model are only convoluted in a frequency dimension without considering the correlation of a time dimension, and pooling layers are only pooled in the time dimension without pooling in the frequency dimension; the specific setting is that 3x1 convolution kernel is adopted, 1x2 pooling is adopted, and the feature distribution characteristics of the speech time-frequency diagram can be matched, and the distribution characteristics of the speech time-frequency diagram have independence between adjacent speech frames and consistency in a specific frequency band.

2. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: when the re-recorded speech is transformed, a 126-length hamming (hanning) window is used for short-time fourier transform, the step size is 50, and the size of the time-frequency diagram is (64 × 62).

3. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithmic model employs deep learning as a data-driven technique.

4. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithm model does not consider the correlation of the time dimension when performing convolution in the frequency dimension, and performs pooling of the time dimension when performing convolution in the frequency dimension.

5. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the convolution kernel can be shared by parameters, the time dimension has characteristic information of the same distributed equipment to repeatedly train the parameters of the convolution kernel, the pooling layer adopts pooling 1x2 of the time dimension, and the frequency dimension does not carry out pooling.