CN114722964B

CN114722964B - Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics

Info

Publication number: CN114722964B
Application number: CN202210450835.1A
Authority: CN
Inventors: 曾春艳; 杨尧; 王志锋; 武明虎; 冯世雄; 孔帅; 余琰; 夏诗言; 李坤
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2024-08-02
Anticipated expiration: 2042-04-26
Also published as: CN114722964A

Abstract

The invention relates to a digital audio tampering passive detection method and a device based on the fusion of power grid frequency space and time sequence characteristics, which firstly process audio data to be detected to obtain an ENF phase of a power grid frequency (ENF) componentAndDetermining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseAn ENF space characteristic matrix is obtained by ENF phaseThe obtained framing data is split into two parts to form ENF time sequence characterization; and acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information. The invention adopts a space and time sequence feature fusion method to comprehensively describe the ENF change in the audio, thereby improving the algorithm detection precision.

Description

Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics

Technical Field

The invention belongs to the technical field of digital audio tamper detection, and particularly relates to a digital audio tamper passive detection method and device based on fusion of power grid frequency space and time sequence characteristics.

Background

With the rapid progress of digital audio technology, people can conveniently collect digital audio signals, but can easily edit and modify the digital audio signals at a later time by utilizing a plurality of audio processing software. If the digital audio with intentional or unintentional tampering is applied to important occasions such as judicial evidence collection, some bad social problems are likely to be caused, so that the digital audio tampering detection research is of great significance.

The passive detection of digital audio tampering is a technology for analyzing and judging the authenticity and the integrity of digital audio by only self-characteristics of the audio without adding any information, and has practical significance for complex evidence obtaining environments. When the recording device is powered by a power grid, a power grid frequency (Electirc Network Frequency, ENF) signal remains in the recorded audio file. When the digital audio is tampered, the ENF signal also changes along with the tampering operation, so that two research ideas are provided for carrying out audio tampering passive detection by utilizing the uniqueness and the stability of the ENF signal, firstly, the ENF signal extracted from the audio is compared with an ENF database of a power supply department, and the method has high implementation difficulty and high cost; and secondly, extracting certain characteristics in the ENF signal, and carrying out consistency and regularity analysis. The current research method for audio tampering evidence obtaining by using ENF signals mainly utilizes the traditional machine learning method to classify the characteristics of the ENF signals, such as phase change, phase discontinuity, instantaneous frequency mutation and the like, so as to achieve the purpose of tampering detection.

In the existing digital audio detection methods, a threshold is set for corresponding features to detect or classify by adopting a machine learning method. These methods often suffer from too much empirical content or are too targeted and insufficiently identifiable for a certain tamper method.

In recent years, with the improvement of the performance of machine learning algorithms and the improvement of the storage and computing power of computers, deep neural networks (Deep Neural Network, DNN) are applied to the field of audio tamper detection. The method can better fit the audio tampering characteristics through DNN deep nonlinear transformation in the deep neural network, realizes automatic learning and detection, and has the advantage of high recognition rate. The method aims at solving the problems that the characteristic information of the existing method is single and the power grid frequency information cannot be fully utilized. Therefore, the invention provides an audio tampering detection method based on fusion of power grid frequency space and time sequence characteristics. The method comprises the steps of firstly taking an ENF phase characteristic matrix as a spatial characteristic, and acquiring ENF spatial information by using a convolutional neural network. And taking the ENF phase time sequence characterization as a time sequence characteristic, and acquiring ENF time sequence information by using the Bi-LSTM network. And then fusing the space and time sequence information through an attention mechanism, and finally classifying real audio and tampered audio by using a DNN classifier.

Disclosure of Invention

The technical problems of the invention are mainly solved by the following technical proposal:

a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of

Processing the audio data to be detected to obtain an Electric Network Frequency (ENF) component, and processing the ENF component based on DFT ¹ conversion to obtain an ENF phaseAnd

Determining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;

And acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information.

The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features is characterized by comprising the steps of processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprising the following steps:

downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;

A10000-order linear zero-phase FIR filter is used for narrow-band filtering, the center frequency is at the ENF standard, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB.

In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the acquiring the ENF phase comprises:

Step 2.1, calculating an approximate first derivative of the ENF signal X _ENFC [ n ] at the point n

X′_ENFC[n]＝f_d(X_ENFC[n]-X_ENFC[n-1]) (1)

Where f _d represents an approximate derivative operation, X _ENFC [ n ] represents the value of the nth point of the ENF component;

Step 2.2, frame-dividing and windowing X _ENFC n and X' _ENFC n, the frame length is 10 standard ENF frequency periods Frame shift to 1 standard ENF frequency periodWindowing X _ENFC [ n ] and X' _ENFC [ n ] with a Hanning window w (n)

X_N[n]＝X_ENF[n]w(n) (2)

X′_N[n]＝X′_ENFC[n]w(n) (3)

Hanning window thereinL is the window length;

Step 2.3, performing N-point Discrete Fourier Transform (DFT) on each frame of signals X _N [ N ] and X '_N [ N ] to obtain X (k) and X' (k);

Step 2.4, let k _peak be the index of the peak value of |x (k) |; k _peak is used for solving

Step 2.5, from the estimated frequency f _DFT of the ENF signal, the ENF phase characteristics can be found

Step 2.6, reevaluation of the DFT ¹ transformed ENF phaseLet k _peak be the index of the peak of |x' (k) |; and multiplying |X' (k) | by a scale factor F (k)

Obtaining DFT ⁰[k]＝X(k),DFT¹ [ k ] =f (k) |x' (k) |; therefore, the estimated frequency value is

Step 2.7, k _peak should be the integer nearest to f _DFT1N_DFT/f_d (f _d is the resampling frequency), soIt is a reasonable frequency value; can be used forRepresented as

Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents a maximum integer smaller than a, ceil [ b ] represents a minimum integer larger than b;

Due to Thus in (k _low,θ_low)＝arg[X′(k_low) ]

(K _high,θ_high)＝arg[X′(k_high) ] linear interpolation can approximate the pointThe obtained value is consistent with the value of θ in the above formula;

Step 2.8, find With two possible values, useFor reference, selectIs closest toAs a final value of (2)

The method for calculating the ENF space feature matrix specifically comprises the following steps of:

step 3.1, acquiring audio data with the longest duration in the audio data to be detected;

step 3.2, for the longest duration audio, obtaining the phase by DFT conversion

Step 3.3, calculating the longest phase

Step 3.4, calculating the frame length m,Wherein the method comprises the steps ofWherein m is the frame length of the frequency characteristic matrix;

step 3.5, calculating the phase of all the audio data

Step 3.6, calculating frame shift and framing; frame shift to

And 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a characteristic matrix P _n×n.

The method for passively detecting digital audio tampering based on the fusion of the power grid frequency space and the time sequence features, the specific method for calculating the ENF time sequence characterization comprises the following steps:

Step 4.1, acquiring the longest duration audio data in the audio data to be detected;

Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion

Step 4.3, setting the frame length m and according toCalculating the number of frames

Step 4.4, for all audio data; calculating a frame shift overlap=m-floor (length (phi)/n);

step 4.5, dividing the sub-frame into two parts due to the condition that the sub-frame cannot be divided completely Frame shift ratio of framesFrame size 1; k=length (Φ) - (m-overlap) ×n

Step 4.6, ENF phase timing characterization as

In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the network model part includes:

Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P _n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32 and 64, the size of the convolution kernels is 3 multiplied by 3, the step length is 1, and the Maxpooling layer poolsize is 3); the last pooling layer outputs ENF space information;

Step 4.2, acquiring time sequence information through a Bi-LSTM network; training the ENF phase time sequence representation by adopting two Bi-directional long-short-term memory neural network Bi-LSTM modules, and outputting the state of each time step; each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu;

Step 4.3, splicing the space and the time sequence characteristics to obtain a characteristic quantity with the length of L;

Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the purpose of the method is to compress the characteristics, and the method is to refer to a SE (sequence-and-specification) network, increase the nonlinearity of the network by a compression method and acquire more accurate weights; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;

step 4.5, multiplying the obtained weight by the space and time sequence characteristics before the attention mechanism to weight; different weights can be given to the spatial and time sequence characteristics through automatic learning, and the characteristics with large influence on the classification result are given larger weights so as to improve the detection precision;

Step 4.6, fitting and classifying the fused space and time sequence characteristics; fully fitting the features by adopting two fully connected layers (the number of neurons is 1024 and 256 respectively, and the activation function is Relu); a Dropout layer (Dropout rate=0.2) is added between the two fully connected layers to prevent overfitting; finally, through the fully connected layer (neuron number 2, activation function Softmax) as the output layer;

And 4.7, finally obtaining the probability of the output layer to obtain whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.

A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics is characterized by comprising

A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT ¹ transformation to obtain an ENF phaseAnd

A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;

and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.

Therefore, the invention has the following advantages: the invention provides a deep learning method for classifying ENF space and time sequence feature fusion. Aiming at the problem that the traditional method is insufficient in feature expression and ENF time sequence information is not fully utilized, the problem that the traditional audio tampering detection method is single in feature is solved by adopting a space and time sequence feature fusion method, ENF changes in audio are more comprehensively described, and algorithm detection accuracy is improved. The Attention mechanism Attention is utilized to fuse the features, and useful information is acquired from the audio ENF through automatic learning and used for tampering detection classification tasks, so that the interference of invalid information on a final result is reduced. Compared with the traditional digital audio tamper detection method, the digital audio tamper detection method can effectively improve the identification performance of the system, improve the generalization capability of the model, optimize the system structure and improve the competitiveness of corresponding equipment source identification products.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a structural diagram of a neural network.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

Examples:

The invention relates to a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence characteristics, wherein an algorithm flow chart of the method is shown in figure 1 and can be divided into four parts: 1) Acquiring an ENF component; 2) Extracting ENF phase characteristics; 3) Acquiring an ENF space feature matrix; 4) Acquiring ENF time sequence characterization; 5) And training a neural network.

Step one: the ENF component is obtained by the following steps:

A. downsampling the audio with a resampling frequency of 1000HZ or 1200HZ;

B. using 10000-order linear zero-phase FIR filter to carry out narrow-band filtering, wherein the center frequency is at the ENF standard (50 HZ or 60 HZ), the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB;

step two: the ENF phase feature extraction comprises the following steps:

A. Calculating a signal first derivative, framing and windowing, discrete Fourier transformation, linear interpolation and estimating phase, and calculating phase fluctuation characteristics:

(A-1) calculating an approximate first derivative of the ENF signal X _ENFC [ n ] at the point n

X′_ENFC[n]＝f_d(X_ENFC[n]-X_ENFC[n-1]) (1)

Where f _d denotes an approximate derivative operation, and X _ENFC [ n ] denotes the value of the nth point of the ENF component.

(A-2) frame-windowing X _ENFC n and X' _ENFC n, the frame length being 10 standard ENF frequency periodsFrame shift to 1 standard ENF frequency periodWindowing X _ENFC [ n ] and X' _ENFC [ n ] with a Hanning window w (n)

X_N[n]＝X_ENF[n]w(n) (2)

X′_N[n]＝X′_ENFC[n]w(n) (3)

Hanning window thereinL is the window length.

(A-3) performing N-point Discrete Fourier Transforms (DFTs) on each frame of signals X _N [ N ] and X '_N [ N ] to obtain X (k) and X' (k), respectively.

(A-4) let k _peak be the index of the peak value of |X (k) |. k _peak is used for solving

(A-5) from the estimated frequency f _DFT of the ENF signal, the ENF phase characteristics can be found

(A-6) reevaluation of the DFT ¹ transformed ENF phaseLet k _peak be the index of the peak of |x' (k) |. And multiplying |X' (k) | by a scale factor F (k)

DFT ⁰[k]＝X(k),DFT¹ [ k ] =f (k) |x' (k) | is obtained. Therefore, the estimated frequency value is

(A-7) k _peak should be closest toInteger (f _d is the resampling frequency), such thatIt is a reasonable frequency value. Can be used forRepresented as

Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents the largest integer smaller than a, ceil [ b ] represents the smallest integer larger than b.

Due toThus, linear interpolation at (k _low,θ_low)＝arg[X′(k_low) ] and (k _high,θ_high)＝arg[X′(k_high) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula.

(A-8) the method of obtainingHaving two possible values and therefore usingFor reference, selectIs closest toAs a final value of (2)

Step three: the ENF space feature matrix is acquired by the following steps:

A. Calculating ENF space feature matrix

(A-1) obtaining the longest-duration audio data among the audio data to be detected.

(A-2) for the longest duration Audio, DFT conversion obtains phase

(A-3) calculating the longest phase

(A-4) calculating a frame length m,Wherein the method comprises the steps ofWhere m is the frequency characteristic matrix frame length.

(A-5) calculating the phases of all the Audio data

(A-6) calculating a frame shift and framing. Frame shift to

(A-7) performing Reshape on the phase and the frequency after framing to obtain a feature matrix P _n×n.

Step four: the ENF phase sequence characterization acquisition comprises the following steps:

A. Calculating ENF phase timing characterization

(A-2) for the longest duration Audio, DFT conversion obtains phase

(A-3) setting the frame Length m and according toCalculating the number of frames

(A-4) for all audio data. Frame shift overlap=m-floor (length (Φ)/n) is calculated.

(A-5) dividing the frame into two parts due to the fact that there is a case where division is impossible Frame shift ratio of framesThe frame is 1 smaller. k=length (Φ) - (m-overlap) ×n

(A-6) ENF phase timing characterization as

Step five: the network model comprises the following steps:

A. Spatial information is obtained through a convolutional neural network. The feature matrix P _n×n is processed with two convolution blocks to obtain ENF spatial information, each convolution block is composed of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32, 64. The convolution kernels are 3×3, and the step size is 1.Maxpooling layer poolsize is 3). The last pooling layer outputs ENF spatial information.

B. And acquiring time sequence information through the Bi-LSTM network. Two Bi-directional long-short-term memory neural network Bi-LSTM modules are adopted to train the ENF phase time sequence representation, and the state of each time step is output. Each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu.

C. The attention mechanism fuses spatial and timing information.

And (C-1) splicing the space and the time sequence features to obtain a feature quantity with the length of L.

(C-1) then passed through three fully connected layers, each of which had an activation function of Relu, but the number of neurons was L, L/8 and L, respectively. The purpose of this is to compress the features, and in this way, reference is made to a SE (Squeeze-and-specification) network, and the nonlinearity of the network is increased by the compression method, so as to obtain more accurate weights. After nonlinear operation, the weights of the space and time sequence features are obtained through a layer of fully connected layers with the number of neurons being L and the activation function being Sigmoid.

(C-1) multiplying the resulting weights by spatial and temporal characteristics prior to the attention mechanism. By automatic learning, different weights can be given to the spatial and time sequence features, and features with large influence on the classification result are given more weight, so that the detection accuracy is improved.

D. Fitting and classifying the fused space and time sequence features. The features were fully fitted using two fully connected layers (1024, 256 neurons number, relu activation function, respectively). A Dropout layer (Dropout rate=0.2) was added between the two fully connected layers to prevent overfitting. Finally, the output layer was made by the fully connected layer (neuron number 2, activation function Softmax).

E. Finally, the probability obtained by the output layer can be used for obtaining whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.

The embodiment also relates to a digital audio tampering passive detection device based on the fusion of the frequency space and the time sequence characteristics of the power grid, which comprises

The specific implementation steps of the three modules adopt the steps from the first step to the fifth step, and are not repeated here.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. A digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of

Acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information;

Acquiring the ENF phase includes:

X′_ENFC[n]＝f_d(X_ENFC[n]-X_ENFC[n-1]) (1)

X_N[n]＝X_ENF[n]w(n) (2)

X′_N[n]＝X′_ENFC[n]w(n) (3)

Hanning window thereinL is the window length;

Step 2.7, k _peak should be closest toInteger (f _d is the resampling frequency), such thatIt is a reasonable frequency value; can be used forRepresented as

Due to Thus, linear interpolation at (k _low,θ_low)＝arg[X′(k_low) ] and (k _high,θ_high)＝arg[X′(k_high) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula;

The specific method for calculating the ENF space feature matrix comprises the following steps:

step 3.2, for the longest duration audio, obtaining the phase by DFT conversion

Step 3.3, calculating the longest phase

step 3.5, calculating the phase of all the audio data

Step 3.6, calculating frame shift and framing; frame shift to

Step 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a feature matrix P _n×n;

The specific method for calculating the ENF time sequence characterization comprises the following steps:

Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion

Step 4.6, ENF phase timing characterization as

The network model part includes:

Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P _n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and a pooling layer, and the number of convolution kernels of the two convolution blocks is 32 and 64; the convolution kernel size is 3×3, and the step size is 1; maxpooling layer poolsize is 3; the last pooling layer outputs ENF space information;

Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the aim of the method is to compress the characteristics, and the method refers to a Squeeze-and-specification network, and the nonlinearity of the network is increased by a compression method to acquire more accurate weight; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;

Step 4.6, fitting and classifying the fused space and time sequence characteristics; the characteristics are fully fitted by adopting two full-connection layers, the number of neurons is 1024 and 256 respectively, and the activation function is Relu; adding Dropout layer (Dropout rate=0.2 to prevent overfitting) between two fully connected layers, and finally, using fully connected layers as output layers, the neuron number is 2, and the activation function is Softmax;

2. The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features as defined in claim 1, wherein the method for passively detecting digital audio tampering is characterized by processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprises the following steps:

downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;

3. A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics, adopting the method of any one of claims 1 to 2, comprising