Nothing Special   »   [go: up one dir, main page]

CN109872720B - Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network - Google Patents

Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network Download PDF

Info

Publication number
CN109872720B
CN109872720B CN201910085725.8A CN201910085725A CN109872720B CN 109872720 B CN109872720 B CN 109872720B CN 201910085725 A CN201910085725 A CN 201910085725A CN 109872720 B CN109872720 B CN 109872720B
Authority
CN
China
Prior art keywords
time
frequency
voice
pooling
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910085725.8A
Other languages
Chinese (zh)
Other versions
CN109872720A (en
Inventor
王泳
赵雅珺
张梦鸽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Polytechnic Normal University
Original Assignee
Guangdong Polytechnic Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Polytechnic Normal University filed Critical Guangdong Polytechnic Normal University
Priority to CN201910085725.8A priority Critical patent/CN109872720B/en
Publication of CN109872720A publication Critical patent/CN109872720A/en
Application granted granted Critical
Publication of CN109872720B publication Critical patent/CN109872720B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a re-recorded voice detection algorithm for different scene robustness based on a convolutional neural network, and particularly relates to the field of voice detection algorithms. The invention adopts the time-frequency diagram as the data input form of the network, and compared with the direct input of voice data, the time-frequency diagram has relatively dense distribution on the characteristic information introduced by the re-recording equipment, thereby being more beneficial to the characteristic extraction of the neural network, accelerating the training and improving the precision.

Description

Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network
Technical Field
The invention relates to the field of voice detection algorithms, in particular to a re-recorded voice detection algorithm for different scene robustness based on a convolutional neural network.
Background
Research has proved that fraudulent voices such as Voice Conversion (VC), voice Synthesis (SS), and re-recorded voices can effectively deceive a Speaker Recognition (ASV) system, thereby impersonating others to log in the system, and re-recorded voices can cause the ASV system to generate a higher false acceptance rate and seriously threaten social security. VC and SS need more voice information and characteristics of target speakers, and the existing algorithm is not completely mature, so that the realization cost and difficulty are relatively high; the re-recorded voice can be easily obtained by using a cheap recording device, and basically contains all the characteristics of the voice of the target character, so the re-recorded voice has more threat to VC and SS. For this reason, detection of the re-recorded voice should be regarded as important.
SV (automatic speaker recognition) systems are increasingly used in practice, for example: access control systems, telephone banking, military, etc. ASV systems are very vulnerable to fraudulent voice attacks because the speaker verification process does not require any face-to-face contact. Fraudulent speech produced by audio equipment can pose a threat to ASV (automatic speaker recognition) systems, affecting the security performance of the systems. In the last decade, not only are audio digital products endlessly graded in types, but also various products are increasingly multifunctional and stronger. The same or similar effects can now be achieved using relatively inexpensive devices such as personal computers equipped with audio processing software or PDAs with audio processing capabilities. For example, a high quality, low cost recording device, a smart phone, that creates fraudulent speech, can pose a risk to ASV systems. Fraudulent speech includes replay attacks, voice conversion, speech synthesis, etc. An attacker can acquire illegal identity access to the system by using the deceptive voice counterfeiting feature data, and then the file data and the privacy of the user can be stolen, so that a lot of irreparable loss is brought. Where replay attacks are more threatening with respect to speech conversion and speech synthesis. Replay attacks are speech samples taken from the actual target speaker in the form of continuous pre-recorded speech samples. Replay-based spoofing attacks do not require any technical processing of the speech, and the actual targeted speaker's speech and the replayed speech have identical spectral and high-level characteristics, which are the easiest types of speech attacks. The synthesized voice and the deformed voice have certain errors and changes relative to the voice of the actual target speaker, and are not completely the same, so that the detection of replay attack has greater difficulty relative to the synthesized voice and the deformed voice.
Disclosure of Invention
In order to overcome the above defects in the prior art, embodiments of the present invention provide a re-recorded voice detection algorithm based on convolutional neural network for different scene robustness, and by using a time-frequency diagram as a data input form of the network in the present invention, compared with directly inputting voice data, the time-frequency diagram has relatively dense distribution for feature information introduced by a re-recording device, and is more beneficial to neural network feature extraction, thereby accelerating training, improving accuracy, and having very high accuracy for detection of re-recorded voices of different recording devices, recording environments, and recording distances.
In order to achieve the purpose, the invention provides the following technical scheme: a re-recorded voice detection algorithm based on a convolutional neural network for different scene robustness specifically comprises the following steps:
a. acquiring original voice by using a recording device, and obtaining re-recorded voice through DA/AD conversion;
b. the original voice generates distortion in the transformation process, and the distortion data of the original voice is calculated through a distortion model, wherein the expression of the distortion model is as follows:
Figure BDA0001961686510000031
y (t) is the re-recorded speech, x (t) is the original speech, λ is the amplitude transformation factor, α is the time axis linear scaling factor, η is the superposition noise;
the corresponding frequency domain variation expression:
Figure BDA0001961686510000032
y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for a fixed recording device, namely lambda and alpha are constants;
c. re-recording the voice to produce a voice time-frequency diagram by short-time Fourier transform;
d. inputting the voice time-frequency diagram into an algorithm model, wherein the algorithm model comprises seven layers, each layer comprises a convolution layer and a pooling layer, the output of the convolution layer passes through a linear rectification function, residual errors are added between the layers for connection, finally, the final characteristics are extracted through global pooling, and the detection result is predicted through sigmoid.
In a preferred embodiment, when transforming the re-recorded speech, the short-time fourier transform uses a 126-length hamming (hanning) window with a step size of 50 and a time-frequency plot size of (64 x 62).
In a preferred embodiment, the algorithm model is convolved in frequency dimension and pooled in time dimension, and is specifically configured to be convolved in 1x2 by using a 3x1 convolution kernel, and is matched with the characteristic distribution characteristics of a time-frequency diagram, and the distribution characteristics of the time-frequency diagram have independence between adjacent speech frames and consistency in a specific frequency band.
In a preferred embodiment, the algorithmic model employs deep learning as a data-driven technique.
In a preferred embodiment, the re-recording device will introduce variations in the frequency domain of the original audio signal, and the deep learning model takes the original audio signal as input data to the network.
In a preferred embodiment, the algorithm model does not consider the correlation of the time dimension when performing the convolution of the frequency dimension, and performs the pooling of the time dimension simultaneously when performing the convolution of the frequency dimension.
In a preferred embodiment, the convolution kernel can be shared by parameters, the time dimension has characteristic information of devices distributed in the same way, the convolution kernel parameters are trained repeatedly, the pooling layer adopts pooling (1 x 2) of the time dimension, and the frequency dimension does not carry out pooling.
The invention has the technical effects and advantages that:
1. the time-frequency graph is used as a data input form of the network, and compared with the mode of directly inputting voice data, the time-frequency graph has relatively dense distribution on feature information introduced by the re-recording equipment, so that the neural network feature extraction is facilitated, the training is accelerated, and the precision is improved;
2. the method adopts convolution in frequency dimension and pooling in time dimension, specifically adopts 3x1 convolution kernel and 1x2 pooling, performs convolution only in frequency dimension, does not consider correlation of time dimension, and can greatly reduce parameter quantity of convolution kernel, so that a model has stronger overfitting resistance, reduce excessive dependence on data quantity, and meanwhile, in the training process, due to parameter sharing of convolution kernel, characteristic information of equipment with the same distribution in time dimension repeatedly trains convolution kernel parameters, so that training can be more sufficient;
3. according to the method, the specific one or more characteristics do not need to be manually selected and then classified by using the classifier as in the traditional machine learning method, and the relevant characteristics including some shallow-layer edge characteristics and deep-layer characteristics can be spontaneously extracted and then classified, so that the whole process is simplified and a better effect is achieved;
4. the algorithm of the invention has high accuracy in detecting the re-recorded voice of different recording devices, recording environments and recording distances.
Drawings
FIG. 1 is a schematic diagram of an algorithm model structure according to the present invention.
Fig. 2 is a schematic diagram of a speech re-recording process according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
As shown in fig. 1, the re-recorded speech detection algorithm based on convolutional neural network robustness to different scenes has a total of 7 layers of algorithm models, each layer comprises a convolutional layer and a pooling layer, the output of the convolutional layer passes through a linear rectification function, residual errors are added between the layers to be connected, finally, the final characteristics are extracted through global pooling, the detection result is predicted through sigmoid, frequency-dimension convolution and time-dimension pooling are adopted, specifically, 3x1 convolutional kernels and 1x2 pooling are adopted, the capacity of the model is maximally reduced, the risk of overfitting is greatly reduced, the dependence of the model on data volume is reduced, the dependence of the model on the data volume is highly matched with the characteristic distribution characteristics of a time-frequency graph, the training parameters are distributed to a more reasonable place, and more compact parameters are trained by using more effective characteristics;
the voice time-frequency diagram is generated by short-time Fourier transform, and compared with the voice data which is directly input, the time-frequency diagram has relatively dense distribution on the characteristic information introduced by the re-recording equipment, so that the neural network characteristic extraction is more facilitated, the training is accelerated, the precision is improved, the re-recording equipment can introduce variation on the frequency domain of the original voice signal, the performance of a deep learning model has extremely high dependence on the data, the original audio signal is used as the input data of the network, the characteristic distribution is too sparse, and the difficulty of extracting effective characteristics by the neural network is greatly improved;
example 2
As shown in fig. 2, in a re-recording speech detection algorithm based on a convolutional neural network and robust to different scenes, re-recording causes a certain degree of distortion of speech data, including amplitude distortion and linear expansion and contraction on a time axis, where a distortion model expression is:
Figure BDA0001961686510000061
y (t) is the re-recorded speech, x (t) is the original speech, λ is the amplitude transformation factor, α is the time axis linear scaling factor, η is the superposition noise;
the corresponding frequency domain variation expression:
Figure BDA0001961686510000062
y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for fixed recording equipment, namely lambda and alpha are constants;
example 3
In this implementation, a 0.2 second voice segment is used as experimental data, a 126-length hamming (hanning) window is used for short-time fourier transform, the step length is 50, and the size of a time-frequency diagram is (64 × 62);
furthermore, in the above technical solution, convolution is performed in the frequency dimension, pooling is performed in the time dimension, convolution is performed only in the frequency dimension, correlation of the time dimension is not considered, and parameter quantity of a convolution kernel can be greatly reduced, so that a model has stronger over-fitting resistance, excessive dependence on data quantity is reduced, and meanwhile, in the training process, due to parameter sharing of the convolution kernel, characteristic information of equipment with the same distribution in the time dimension repeatedly trains convolution kernel parameters, so that training is more sufficient, pooling (1 x 2) of the time dimension is performed in a pooling layer, pooling of the frequency dimension is not performed, pooling can reduce the dimension of the characteristic, accelerate calculation of a network, and enable a network structure to have stronger robustness on data characteristic expansion and deformation, for a time-frequency diagram, characteristic distribution does not have expansion and deformation, pooling is performed only in the time dimension, so that characteristic convolution is reduced, and loss of frequency dimension characteristics is not caused, and finally, the characteristic dimension becomes one-dimensional through multi-layer and pooling calculation, and length is the same as the frequency of the time-frequency diagram;
further, in the above technical scheme, the original voice library is composed of 30000 voice recorded by 60 persons in total, the sampling frequency is 16kHz, and the quantization precision is 16bits;
randomly selecting the voices of 10 speakers as test data, and using the voices of the other 50 speakers for training, so that the independence of the training data and the test data is ensured, and the recording of the same speaker is prevented from appearing in different data sets;
the specific recording process is as follows: for the training set, the original speech library is re-recorded 4 times by different distances and equipment combinations under a quiet environment, so that 4 re-recorded speech libraries are obtained, each of the 4 re-recorded speech libraries comprises 25000 sections of speech, 25000 sections of speech in total are randomly extracted from the 4 speech libraries to serve as negative samples, and the negative samples and the original speech form 50000 sections of training data sets. Original voice is played through a portable computer association Y40-70 AT-IFI; the re-recording equipment is portable computer Dall Inspion Lingye 14 (Ins 14 VD-258) and smart phone millet 2S;
the 4 recordings are shown in table 1:
TABLE 1 recording of speech
Figure BDA0001961686510000071
Figure BDA0001961686510000081
For test data, adopting the same recording setting of the second table, in order to verify the robustness of the voice interfered by the model to the environmental random noise, respectively recording the voice in a quiet environment and an environment with certain random noise, wherein the test set comprises 4 voice libraries in total, and each voice library comprises 10000 test voices in total in the quiet environment and the environment noise under the library recording mode;
further, in the above technical solution, the network error function is a cross entropy loss function, training is performed by using Adam optimization algorithm, the initial learning rate is set to 0.001, the learning rate is dynamically adjusted in the training process, the learning rate is reduced by one time every 10000 times of training, the batch size of each training is 32, in order to supervise the training effect in the training process, 2000 pieces of data are randomly selected from the training data for verification, and by comparing the training data loss function with the verification data loss function, a regularization term is added to the loss function and a regularization coefficient is set to 0.0001, so that overfitting can be effectively prevented;
table 2 lists some important hyper-parameter settings in the training process, under which the network converges rapidly during the training process and finally achieves a rather high accuracy;
TABLE 2 hyper-parameter (. Beta.) 1 、β 2 Adam optimizer parameters respectively)
Figure BDA0001961686510000082
Further, in the above technical solution, this embodiment includes 4 experimental tests, which are performed for different recording devices and different recording distances, and an experimental result obtained in each experiment is shown in table 3:
TABLE 3 results of the experiment
Figure BDA0001961686510000091
The test accuracy under different conditions can reach more than 99.8%, and the experimental model has good universality.
And finally: the above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that are within the spirit and principle of the present invention are intended to be included in the scope of the present invention.

Claims (5)

1. A re-recorded voice detection algorithm based on a convolutional neural network and robust to different scenes is characterized by comprising the following steps:
a. acquiring original voice by using a recording device, and obtaining re-recorded voice through DA/AD conversion;
b. the original voice can generate distortion in the transformation process, and the distortion data of the original voice is calculated through a distortion model, wherein the expression of the distortion model is as follows:
Figure FDA0003809775420000011
y (t) is the re-recorded speech, x (t) is the original speech, λ is the amplitude transformation factor, α is the time axis linear scaling factor, η is the superposition noise;
the corresponding frequency domain variation expression:
Y(jω)=λX(jαω)+N(jω),
y (j omega), X (j omega) and N (j omega) are respectively frequency domain representations of Y (t), X (t) and eta, and are characterized by being very stable for a fixed recording device, namely lambda and alpha are constants;
c. re-recording the voice and producing a voice time-frequency graph by short-time Fourier transform;
d. inputting a voice time-frequency graph into an algorithm model, wherein the algorithm model comprises seven layers, each layer comprises a convolution layer and a pooling layer, the output of the convolution layer passes through a linear rectification function, residual connection is added between the layers, and finally, the final characteristics are extracted through global pooling, and the detection result is predicted through sigmoid;
convolution layers of the algorithm model are only convoluted in a frequency dimension without considering the correlation of a time dimension, and pooling layers are only pooled in the time dimension without pooling in the frequency dimension; the specific setting is that 3x1 convolution kernel is adopted, 1x2 pooling is adopted, and the feature distribution characteristics of the speech time-frequency diagram can be matched, and the distribution characteristics of the speech time-frequency diagram have independence between adjacent speech frames and consistency in a specific frequency band.
2. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: when the re-recorded speech is transformed, a 126-length hamming (hanning) window is used for short-time fourier transform, the step size is 50, and the size of the time-frequency diagram is (64 × 62).
3. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithmic model employs deep learning as a data-driven technique.
4. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the algorithm model does not consider the correlation of the time dimension when performing convolution in the frequency dimension, and performs pooling of the time dimension when performing convolution in the frequency dimension.
5. The convolutional neural network-based re-recorded speech detection algorithm robust to different scenarios as claimed in claim 1, wherein: the convolution kernel can be shared by parameters, the time dimension has characteristic information of the same distributed equipment to repeatedly train the parameters of the convolution kernel, the pooling layer adopts pooling 1x2 of the time dimension, and the frequency dimension does not carry out pooling.
CN201910085725.8A 2019-01-29 2019-01-29 Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network Active CN109872720B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910085725.8A CN109872720B (en) 2019-01-29 2019-01-29 Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910085725.8A CN109872720B (en) 2019-01-29 2019-01-29 Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN109872720A CN109872720A (en) 2019-06-11
CN109872720B true CN109872720B (en) 2022-11-22

Family

ID=66918246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910085725.8A Active CN109872720B (en) 2019-01-29 2019-01-29 Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN109872720B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110211604A (en) * 2019-06-17 2019-09-06 广东技术师范大学 A kind of depth residual error network structure for voice deformation detection
CN112614483B (en) * 2019-09-18 2024-07-16 珠海格力电器股份有限公司 Modeling method, voice recognition method and electronic equipment based on residual convolution network
CN110797031A (en) * 2019-09-19 2020-02-14 厦门快商通科技股份有限公司 Voice change detection method, system, mobile terminal and storage medium
CN110689902B (en) * 2019-12-11 2020-07-14 北京影谱科技股份有限公司 Audio signal time sequence processing method, device and system based on neural network and computer readable storage medium
CN111370028A (en) * 2020-02-17 2020-07-03 厦门快商通科技股份有限公司 Voice distortion detection method and system
CN111916067A (en) * 2020-07-27 2020-11-10 腾讯科技(深圳)有限公司 Training method and device of voice recognition model, electronic equipment and storage medium
CN112767951A (en) * 2021-01-22 2021-05-07 广东技术师范大学 Voice conversion visual detection method based on deep dense network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10229700B2 (en) * 2015-09-24 2019-03-12 Google Llc Voice activity detection
US10224058B2 (en) * 2016-09-07 2019-03-05 Google Llc Enhanced multi-channel acoustic models
CN108198561A (en) * 2017-12-13 2018-06-22 宁波大学 A kind of pirate recordings speech detection method based on convolutional neural networks
CN109065030B (en) * 2018-08-01 2020-06-30 上海大学 Convolutional neural network-based environmental sound identification method and system

Also Published As

Publication number Publication date
CN109872720A (en) 2019-06-11

Similar Documents

Publication Publication Date Title
CN109872720B (en) Re-recorded voice detection algorithm for different scene robustness based on convolutional neural network
US10540988B2 (en) Method and apparatus for sound event detection robust to frequency change
CN108198561A (en) A kind of pirate recordings speech detection method based on convolutional neural networks
CN116416997A (en) Intelligent voice fake attack detection method based on attention mechanism
CN115605946A (en) System and method for multi-channel speech detection
Yan et al. Audio deepfake detection system with neural stitching for add 2022
Ozer et al. Lanczos kernel based spectrogram image features for sound classification
CN116488942B (en) Back door safety assessment method for intelligent voiceprint recognition system
Dixit et al. Review of audio deepfake detection techniques: Issues and prospects
Himawan et al. Deep domain adaptation for anti-spoofing in speaker verification systems
CN116469395A (en) Speaker recognition method based on Fca-Res2Net fusion self-attention
Müller et al. Complex-valued neural networks for voice anti-spoofing
CN115188384A (en) Voiceprint recognition countermeasure sample defense method based on cosine similarity and voice denoising
CN118212929A (en) Personalized Ambiosonic voice enhancement method
Choi et al. Light-weight Frequency Information Aware Neural Network Architecture for Voice Spoofing Detection
CN116386664A (en) Voice counterfeiting detection method, device, system and storage medium
Ibitoye et al. A GAN-based Approach for Mitigating Inference Attacks in Smart Home Environment
CN116110417A (en) Data enhancement method and device for ultrasonic voiceprint anti-counterfeiting
Shi et al. Anti-replay: A fast and lightweight voice replay attack detection system
Chang et al. Intra-utterance similarity preserving knowledge distillation for audio tagging
Akesbi Audio denoising for robust audio fingerprinting
Guo et al. Towards the universal defense for query-based audio adversarial attacks on speech recognition system
CN116631406B (en) Identity feature extraction method, equipment and storage medium based on acoustic feature generation
Santos et al. Audio Attacks and Defenses against AED Systems--A Practical Study
Kilinc et al. Audio Deepfake Detection by using Machine and Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 510665 293 Zhongshan Avenue, Tianhe District, Guangzhou, Guangdong.

Applicant after: GUANGDONG POLYTECHNIC NORMAL University

Address before: 510665 293 Zhongshan Avenue, Tianhe District, Guangzhou, Guangdong.

Applicant before: GUANGDONG POLYTECHNIC NORMAL University

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant