CN114722964B - Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics - Google Patents
Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics Download PDFInfo
- Publication number
- CN114722964B CN114722964B CN202210450835.1A CN202210450835A CN114722964B CN 114722964 B CN114722964 B CN 114722964B CN 202210450835 A CN202210450835 A CN 202210450835A CN 114722964 B CN114722964 B CN 114722964B
- Authority
- CN
- China
- Prior art keywords
- enf
- time sequence
- phase
- space
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 230000004927 fusion Effects 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 47
- 239000011159 matrix material Substances 0.000 claims abstract description 27
- 238000012512 characterization method Methods 0.000 claims abstract description 21
- 230000037433 frameshift Effects 0.000 claims abstract description 20
- 238000009432 framing Methods 0.000 claims abstract description 18
- 238000013528 artificial neural network Methods 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims abstract description 7
- 230000004913 activation Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 15
- 210000002569 neuron Anatomy 0.000 claims description 12
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000012952 Resampling Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 abstract description 2
- 238000007500 overflow downdraw method Methods 0.000 abstract description 2
- 238000010801 machine learning Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 238000000819 phase cycle Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S40/00—Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
- Y04S40/20—Information technology specific aspects, e.g. CAD, simulation, modelling, system security
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a digital audio tampering passive detection method and a device based on the fusion of power grid frequency space and time sequence characteristics, which firstly process audio data to be detected to obtain an ENF phase of a power grid frequency (ENF) componentAndDetermining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseAn ENF space characteristic matrix is obtained by ENF phaseThe obtained framing data is split into two parts to form ENF time sequence characterization; and acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information. The invention adopts a space and time sequence feature fusion method to comprehensively describe the ENF change in the audio, thereby improving the algorithm detection precision.
Description
Technical Field
The invention belongs to the technical field of digital audio tamper detection, and particularly relates to a digital audio tamper passive detection method and device based on fusion of power grid frequency space and time sequence characteristics.
Background
With the rapid progress of digital audio technology, people can conveniently collect digital audio signals, but can easily edit and modify the digital audio signals at a later time by utilizing a plurality of audio processing software. If the digital audio with intentional or unintentional tampering is applied to important occasions such as judicial evidence collection, some bad social problems are likely to be caused, so that the digital audio tampering detection research is of great significance.
The passive detection of digital audio tampering is a technology for analyzing and judging the authenticity and the integrity of digital audio by only self-characteristics of the audio without adding any information, and has practical significance for complex evidence obtaining environments. When the recording device is powered by a power grid, a power grid frequency (Electirc Network Frequency, ENF) signal remains in the recorded audio file. When the digital audio is tampered, the ENF signal also changes along with the tampering operation, so that two research ideas are provided for carrying out audio tampering passive detection by utilizing the uniqueness and the stability of the ENF signal, firstly, the ENF signal extracted from the audio is compared with an ENF database of a power supply department, and the method has high implementation difficulty and high cost; and secondly, extracting certain characteristics in the ENF signal, and carrying out consistency and regularity analysis. The current research method for audio tampering evidence obtaining by using ENF signals mainly utilizes the traditional machine learning method to classify the characteristics of the ENF signals, such as phase change, phase discontinuity, instantaneous frequency mutation and the like, so as to achieve the purpose of tampering detection.
In the existing digital audio detection methods, a threshold is set for corresponding features to detect or classify by adopting a machine learning method. These methods often suffer from too much empirical content or are too targeted and insufficiently identifiable for a certain tamper method.
In recent years, with the improvement of the performance of machine learning algorithms and the improvement of the storage and computing power of computers, deep neural networks (Deep Neural Network, DNN) are applied to the field of audio tamper detection. The method can better fit the audio tampering characteristics through DNN deep nonlinear transformation in the deep neural network, realizes automatic learning and detection, and has the advantage of high recognition rate. The method aims at solving the problems that the characteristic information of the existing method is single and the power grid frequency information cannot be fully utilized. Therefore, the invention provides an audio tampering detection method based on fusion of power grid frequency space and time sequence characteristics. The method comprises the steps of firstly taking an ENF phase characteristic matrix as a spatial characteristic, and acquiring ENF spatial information by using a convolutional neural network. And taking the ENF phase time sequence characterization as a time sequence characteristic, and acquiring ENF time sequence information by using the Bi-LSTM network. And then fusing the space and time sequence information through an attention mechanism, and finally classifying real audio and tampered audio by using a DNN classifier.
Disclosure of Invention
The technical problems of the invention are mainly solved by the following technical proposal:
a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of
Processing the audio data to be detected to obtain an Electric Network Frequency (ENF) component, and processing the ENF component based on DFT 1 conversion to obtain an ENF phaseAnd
Determining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
And acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information.
The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features is characterized by comprising the steps of processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprising the following steps:
downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;
A10000-order linear zero-phase FIR filter is used for narrow-band filtering, the center frequency is at the ENF standard, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB.
In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the acquiring the ENF phase comprises:
Step 2.1, calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d represents an approximate derivative operation, X ENFC [ n ] represents the value of the nth point of the ENF component;
Step 2.2, frame-dividing and windowing X ENFC n and X' ENFC n, the frame length is 10 standard ENF frequency periods Frame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length;
Step 2.3, performing N-point Discrete Fourier Transform (DFT) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k);
Step 2.4, let k peak be the index of the peak value of |x (k) |; k peak is used for solving
Step 2.5, from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
Step 2.6, reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |; and multiplying |X' (k) | by a scale factor F (k)
Obtaining DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) |; therefore, the estimated frequency value is
Step 2.7, k peak should be the integer nearest to f DFT1NDFT/fd (f d is the resampling frequency), soIt is a reasonable frequency value; can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents a maximum integer smaller than a, ceil [ b ] represents a minimum integer larger than b;
Due to Thus in (k low,θlow)=arg[X′(klow) ]
(K high,θhigh)=arg[X′(khigh) ] linear interpolation can approximate the pointThe obtained value is consistent with the value of θ in the above formula;
Step 2.8, find With two possible values, useFor reference, selectIs closest toAs a final value of (2)
The method for calculating the ENF space feature matrix specifically comprises the following steps of:
step 3.1, acquiring audio data with the longest duration in the audio data to be detected;
step 3.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 3.3, calculating the longest phase
Step 3.4, calculating the frame length m,Wherein the method comprises the steps ofWherein m is the frame length of the frequency characteristic matrix;
step 3.5, calculating the phase of all the audio data
Step 3.6, calculating frame shift and framing; frame shift to
And 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a characteristic matrix P n×n.
The method for passively detecting digital audio tampering based on the fusion of the power grid frequency space and the time sequence features, the specific method for calculating the ENF time sequence characterization comprises the following steps:
Step 4.1, acquiring the longest duration audio data in the audio data to be detected;
Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 4.3, setting the frame length m and according toCalculating the number of frames
Step 4.4, for all audio data; calculating a frame shift overlap=m-floor (length (phi)/n);
step 4.5, dividing the sub-frame into two parts due to the condition that the sub-frame cannot be divided completely Frame shift ratio of framesFrame size 1; k=length (Φ) - (m-overlap) ×n
Step 4.6, ENF phase timing characterization as
In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the network model part includes:
Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32 and 64, the size of the convolution kernels is 3 multiplied by 3, the step length is 1, and the Maxpooling layer poolsize is 3); the last pooling layer outputs ENF space information;
Step 4.2, acquiring time sequence information through a Bi-LSTM network; training the ENF phase time sequence representation by adopting two Bi-directional long-short-term memory neural network Bi-LSTM modules, and outputting the state of each time step; each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu;
Step 4.3, splicing the space and the time sequence characteristics to obtain a characteristic quantity with the length of L;
Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the purpose of the method is to compress the characteristics, and the method is to refer to a SE (sequence-and-specification) network, increase the nonlinearity of the network by a compression method and acquire more accurate weights; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;
step 4.5, multiplying the obtained weight by the space and time sequence characteristics before the attention mechanism to weight; different weights can be given to the spatial and time sequence characteristics through automatic learning, and the characteristics with large influence on the classification result are given larger weights so as to improve the detection precision;
Step 4.6, fitting and classifying the fused space and time sequence characteristics; fully fitting the features by adopting two fully connected layers (the number of neurons is 1024 and 256 respectively, and the activation function is Relu); a Dropout layer (Dropout rate=0.2) is added between the two fully connected layers to prevent overfitting; finally, through the fully connected layer (neuron number 2, activation function Softmax) as the output layer;
And 4.7, finally obtaining the probability of the output layer to obtain whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics is characterized by comprising
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
Therefore, the invention has the following advantages: the invention provides a deep learning method for classifying ENF space and time sequence feature fusion. Aiming at the problem that the traditional method is insufficient in feature expression and ENF time sequence information is not fully utilized, the problem that the traditional audio tampering detection method is single in feature is solved by adopting a space and time sequence feature fusion method, ENF changes in audio are more comprehensively described, and algorithm detection accuracy is improved. The Attention mechanism Attention is utilized to fuse the features, and useful information is acquired from the audio ENF through automatic learning and used for tampering detection classification tasks, so that the interference of invalid information on a final result is reduced. Compared with the traditional digital audio tamper detection method, the digital audio tamper detection method can effectively improve the identification performance of the system, improve the generalization capability of the model, optimize the system structure and improve the competitiveness of corresponding equipment source identification products.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a structural diagram of a neural network.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Examples:
The invention relates to a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence characteristics, wherein an algorithm flow chart of the method is shown in figure 1 and can be divided into four parts: 1) Acquiring an ENF component; 2) Extracting ENF phase characteristics; 3) Acquiring an ENF space feature matrix; 4) Acquiring ENF time sequence characterization; 5) And training a neural network.
Step one: the ENF component is obtained by the following steps:
A. downsampling the audio with a resampling frequency of 1000HZ or 1200HZ;
B. using 10000-order linear zero-phase FIR filter to carry out narrow-band filtering, wherein the center frequency is at the ENF standard (50 HZ or 60 HZ), the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB;
step two: the ENF phase feature extraction comprises the following steps:
A. Calculating a signal first derivative, framing and windowing, discrete Fourier transformation, linear interpolation and estimating phase, and calculating phase fluctuation characteristics:
(A-1) calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d denotes an approximate derivative operation, and X ENFC [ n ] denotes the value of the nth point of the ENF component.
(A-2) frame-windowing X ENFC n and X' ENFC n, the frame length being 10 standard ENF frequency periodsFrame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length.
(A-3) performing N-point Discrete Fourier Transforms (DFTs) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k), respectively.
(A-4) let k peak be the index of the peak value of |X (k) |. k peak is used for solving
(A-5) from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
(A-6) reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |. And multiplying |X' (k) | by a scale factor F (k)
DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) | is obtained. Therefore, the estimated frequency value is
(A-7) k peak should be closest toInteger (f d is the resampling frequency), such thatIt is a reasonable frequency value. Can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents the largest integer smaller than a, ceil [ b ] represents the smallest integer larger than b.
Due toThus, linear interpolation at (k low,θlow)=arg[X′(klow) ] and (k high,θhigh)=arg[X′(khigh) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula.
(A-8) the method of obtainingHaving two possible values and therefore usingFor reference, selectIs closest toAs a final value of (2)
Step three: the ENF space feature matrix is acquired by the following steps:
A. Calculating ENF space feature matrix
(A-1) obtaining the longest-duration audio data among the audio data to be detected.
(A-2) for the longest duration Audio, DFT conversion obtains phase
(A-3) calculating the longest phase
(A-4) calculating a frame length m,Wherein the method comprises the steps ofWhere m is the frequency characteristic matrix frame length.
(A-5) calculating the phases of all the Audio data
(A-6) calculating a frame shift and framing. Frame shift to
(A-7) performing Reshape on the phase and the frequency after framing to obtain a feature matrix P n×n.
Step four: the ENF phase sequence characterization acquisition comprises the following steps:
A. Calculating ENF phase timing characterization
(A-1) obtaining the longest-duration audio data among the audio data to be detected.
(A-2) for the longest duration Audio, DFT conversion obtains phase
(A-3) setting the frame Length m and according toCalculating the number of frames
(A-4) for all audio data. Frame shift overlap=m-floor (length (Φ)/n) is calculated.
(A-5) dividing the frame into two parts due to the fact that there is a case where division is impossible Frame shift ratio of framesThe frame is 1 smaller. k=length (Φ) - (m-overlap) ×n
(A-6) ENF phase timing characterization as
Step five: the network model comprises the following steps:
A. Spatial information is obtained through a convolutional neural network. The feature matrix P n×n is processed with two convolution blocks to obtain ENF spatial information, each convolution block is composed of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32, 64. The convolution kernels are 3×3, and the step size is 1.Maxpooling layer poolsize is 3). The last pooling layer outputs ENF spatial information.
B. And acquiring time sequence information through the Bi-LSTM network. Two Bi-directional long-short-term memory neural network Bi-LSTM modules are adopted to train the ENF phase time sequence representation, and the state of each time step is output. Each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu.
C. The attention mechanism fuses spatial and timing information.
And (C-1) splicing the space and the time sequence features to obtain a feature quantity with the length of L.
(C-1) then passed through three fully connected layers, each of which had an activation function of Relu, but the number of neurons was L, L/8 and L, respectively. The purpose of this is to compress the features, and in this way, reference is made to a SE (Squeeze-and-specification) network, and the nonlinearity of the network is increased by the compression method, so as to obtain more accurate weights. After nonlinear operation, the weights of the space and time sequence features are obtained through a layer of fully connected layers with the number of neurons being L and the activation function being Sigmoid.
(C-1) multiplying the resulting weights by spatial and temporal characteristics prior to the attention mechanism. By automatic learning, different weights can be given to the spatial and time sequence features, and features with large influence on the classification result are given more weight, so that the detection accuracy is improved.
D. Fitting and classifying the fused space and time sequence features. The features were fully fitted using two fully connected layers (1024, 256 neurons number, relu activation function, respectively). A Dropout layer (Dropout rate=0.2) was added between the two fully connected layers to prevent overfitting. Finally, the output layer was made by the fully connected layer (neuron number 2, activation function Softmax).
E. Finally, the probability obtained by the output layer can be used for obtaining whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
The embodiment also relates to a digital audio tampering passive detection device based on the fusion of the frequency space and the time sequence characteristics of the power grid, which comprises
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
The specific implementation steps of the three modules adopt the steps from the first step to the fifth step, and are not repeated here.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.
Claims (3)
1. A digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of
Processing the audio data to be detected to obtain an Electric Network Frequency (ENF) component, and processing the ENF component based on DFT 1 conversion to obtain an ENF phaseAnd
Determining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
Acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information;
Acquiring the ENF phase includes:
Step 2.1, calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d represents an approximate derivative operation, X ENFC [ n ] represents the value of the nth point of the ENF component;
Step 2.2, frame-dividing and windowing X ENFC n and X' ENFC n, the frame length is 10 standard ENF frequency periods Frame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length;
Step 2.3, performing N-point Discrete Fourier Transform (DFT) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k);
Step 2.4, let k peak be the index of the peak value of |x (k) |; k peak is used for solving
Step 2.5, from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
Step 2.6, reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |; and multiplying |X' (k) | by a scale factor F (k)
Obtaining DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) |; therefore, the estimated frequency value is
Step 2.7, k peak should be closest toInteger (f d is the resampling frequency), such thatIt is a reasonable frequency value; can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents a maximum integer smaller than a, ceil [ b ] represents a minimum integer larger than b;
Due to Thus, linear interpolation at (k low,θlow)=arg[X′(klow) ] and (k high,θhigh)=arg[X′(khigh) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula;
Step 2.8, find With two possible values, useFor reference, selectIs closest toAs a final value of (2)
The specific method for calculating the ENF space feature matrix comprises the following steps:
step 3.1, acquiring audio data with the longest duration in the audio data to be detected;
step 3.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 3.3, calculating the longest phase
Step 3.4, calculating the frame length m,Wherein the method comprises the steps ofWherein m is the frame length of the frequency characteristic matrix;
step 3.5, calculating the phase of all the audio data
Step 3.6, calculating frame shift and framing; frame shift to
Step 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a feature matrix P n×n;
The specific method for calculating the ENF time sequence characterization comprises the following steps:
Step 4.1, acquiring the longest duration audio data in the audio data to be detected;
Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 4.3, setting the frame length m and according toCalculating the number of frames
Step 4.4, for all audio data; calculating a frame shift overlap=m-floor (length (phi)/n);
step 4.5, dividing the sub-frame into two parts due to the condition that the sub-frame cannot be divided completely Frame shift ratio of framesFrame size 1; k=length (Φ) - (m-overlap) ×n
Step 4.6, ENF phase timing characterization as
The network model part includes:
Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and a pooling layer, and the number of convolution kernels of the two convolution blocks is 32 and 64; the convolution kernel size is 3×3, and the step size is 1; maxpooling layer poolsize is 3; the last pooling layer outputs ENF space information;
Step 4.2, acquiring time sequence information through a Bi-LSTM network; training the ENF phase time sequence representation by adopting two Bi-directional long-short-term memory neural network Bi-LSTM modules, and outputting the state of each time step; each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu;
Step 4.3, splicing the space and the time sequence characteristics to obtain a characteristic quantity with the length of L;
Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the aim of the method is to compress the characteristics, and the method refers to a Squeeze-and-specification network, and the nonlinearity of the network is increased by a compression method to acquire more accurate weight; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;
step 4.5, multiplying the obtained weight by the space and time sequence characteristics before the attention mechanism to weight; different weights can be given to the spatial and time sequence characteristics through automatic learning, and the characteristics with large influence on the classification result are given larger weights so as to improve the detection precision;
Step 4.6, fitting and classifying the fused space and time sequence characteristics; the characteristics are fully fitted by adopting two full-connection layers, the number of neurons is 1024 and 256 respectively, and the activation function is Relu; adding Dropout layer (Dropout rate=0.2 to prevent overfitting) between two fully connected layers, and finally, using fully connected layers as output layers, the neuron number is 2, and the activation function is Softmax;
And 4.7, finally obtaining the probability of the output layer to obtain whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
2. The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features as defined in claim 1, wherein the method for passively detecting digital audio tampering is characterized by processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprises the following steps:
downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;
A10000-order linear zero-phase FIR filter is used for narrow-band filtering, the center frequency is at the ENF standard, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB.
3. A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics, adopting the method of any one of claims 1 to 2, comprising
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210450835.1A CN114722964B (en) | 2022-04-26 | 2022-04-26 | Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210450835.1A CN114722964B (en) | 2022-04-26 | 2022-04-26 | Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114722964A CN114722964A (en) | 2022-07-08 |
CN114722964B true CN114722964B (en) | 2024-08-02 |
Family
ID=82245886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210450835.1A Active CN114722964B (en) | 2022-04-26 | 2022-04-26 | Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114722964B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118368483B (en) * | 2024-06-19 | 2024-09-06 | 华侨大学 | Method, device, equipment and medium for detecting video inter-frame tampering in power grid environment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284717A (en) * | 2018-09-25 | 2019-01-29 | 华中师范大学 | It is a kind of to paste the detection method and system for distorting operation towards digital audio duplication |
CN112151067A (en) * | 2020-09-27 | 2020-12-29 | 湖北工业大学 | Passive detection method for digital audio tampering based on convolutional neural network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11069370B2 (en) * | 2016-01-11 | 2021-07-20 | University Of Tennessee Research Foundation | Tampering detection and location identification of digital audio recordings |
CN112489677B (en) * | 2020-11-20 | 2023-09-22 | 平安科技(深圳)有限公司 | Voice endpoint detection method, device, equipment and medium based on neural network |
-
2022
- 2022-04-26 CN CN202210450835.1A patent/CN114722964B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109284717A (en) * | 2018-09-25 | 2019-01-29 | 华中师范大学 | It is a kind of to paste the detection method and system for distorting operation towards digital audio duplication |
CN112151067A (en) * | 2020-09-27 | 2020-12-29 | 湖北工业大学 | Passive detection method for digital audio tampering based on convolutional neural network |
Also Published As
Publication number | Publication date |
---|---|
CN114722964A (en) | 2022-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110728360B (en) | Micro-energy device energy identification method based on BP neural network | |
CN112885372B (en) | Intelligent diagnosis method, system, terminal and medium for power equipment fault sound | |
CN112151067B (en) | Digital audio tampering passive detection method based on convolutional neural network | |
CN111986699B (en) | Sound event detection method based on full convolution network | |
CN112446242A (en) | Acoustic scene classification method and device and corresponding equipment | |
CN113707175B (en) | Acoustic event detection system based on feature decomposition classifier and adaptive post-processing | |
WO2023226355A1 (en) | Dual-ion battery fault detection method and system based on multi-source perception | |
CN114722964B (en) | Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics | |
CN109658943A (en) | A kind of detection method of audio-frequency noise, device, storage medium and mobile terminal | |
CN112529177A (en) | Vehicle collision detection method and device | |
CN115393968A (en) | Audio-visual event positioning method fusing self-supervision multi-mode features | |
Zhang et al. | Temporal Transformer Networks for Acoustic Scene Classification. | |
CN116741159A (en) | Audio classification and model training method and device, electronic equipment and storage medium | |
CN114067829A (en) | Reactor fault diagnosis method and device, computer equipment and storage medium | |
CN115270906A (en) | Passive digital audio tampering detection method and device based on power grid frequency depth layer feature fusion | |
CN116584956A (en) | Single-channel electroencephalogram sleepiness detection method based on lightweight neural network | |
Čavor et al. | Vehicle speed estimation from audio signals using 1d convolutional neural networks | |
CN114822590B (en) | Digital audio tampering passive detection method and device based on power grid frequency phase timing sequence characterization | |
CN113177536B (en) | Vehicle collision detection method and device based on deep residual shrinkage network | |
CN114121025A (en) | Voiceprint fault intelligent detection method and device for substation equipment | |
PVSMS | A deep learning based system to predict the noise (disturbance) in audio files | |
Luo et al. | Sound-Convolutional Recurrent Neural Networks for Vehicle Classification Based on Vehicle Acoustic Signals | |
CN111048203B (en) | Brain blood flow regulator evaluation device | |
CN118427670B (en) | Online data monitoring method and system for battery exchange cabinet | |
CN118503893B (en) | Time sequence data anomaly detection method and device based on space-time characteristic representation difference |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |