Nothing Special   »   [go: up one dir, main page]

CN114722964B - Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics - Google Patents

Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics Download PDF

Info

Publication number
CN114722964B
CN114722964B CN202210450835.1A CN202210450835A CN114722964B CN 114722964 B CN114722964 B CN 114722964B CN 202210450835 A CN202210450835 A CN 202210450835A CN 114722964 B CN114722964 B CN 114722964B
Authority
CN
China
Prior art keywords
enf
time sequence
phase
space
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210450835.1A
Other languages
Chinese (zh)
Other versions
CN114722964A (en
Inventor
曾春艳
杨尧
王志锋
武明虎
冯世雄
孔帅
余琰
夏诗言
李坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN202210450835.1A priority Critical patent/CN114722964B/en
Publication of CN114722964A publication Critical patent/CN114722964A/en
Application granted granted Critical
Publication of CN114722964B publication Critical patent/CN114722964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S40/00Systems for electrical power generation, transmission, distribution or end-user application management characterised by the use of communication or information technologies, or communication or information technology specific aspects supporting them
    • Y04S40/20Information technology specific aspects, e.g. CAD, simulation, modelling, system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Complex Calculations (AREA)

Abstract

The invention relates to a digital audio tampering passive detection method and a device based on the fusion of power grid frequency space and time sequence characteristics, which firstly process audio data to be detected to obtain an ENF phase of a power grid frequency (ENF) componentAndDetermining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseAn ENF space characteristic matrix is obtained by ENF phaseThe obtained framing data is split into two parts to form ENF time sequence characterization; and acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information. The invention adopts a space and time sequence feature fusion method to comprehensively describe the ENF change in the audio, thereby improving the algorithm detection precision.

Description

Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics
Technical Field
The invention belongs to the technical field of digital audio tamper detection, and particularly relates to a digital audio tamper passive detection method and device based on fusion of power grid frequency space and time sequence characteristics.
Background
With the rapid progress of digital audio technology, people can conveniently collect digital audio signals, but can easily edit and modify the digital audio signals at a later time by utilizing a plurality of audio processing software. If the digital audio with intentional or unintentional tampering is applied to important occasions such as judicial evidence collection, some bad social problems are likely to be caused, so that the digital audio tampering detection research is of great significance.
The passive detection of digital audio tampering is a technology for analyzing and judging the authenticity and the integrity of digital audio by only self-characteristics of the audio without adding any information, and has practical significance for complex evidence obtaining environments. When the recording device is powered by a power grid, a power grid frequency (Electirc Network Frequency, ENF) signal remains in the recorded audio file. When the digital audio is tampered, the ENF signal also changes along with the tampering operation, so that two research ideas are provided for carrying out audio tampering passive detection by utilizing the uniqueness and the stability of the ENF signal, firstly, the ENF signal extracted from the audio is compared with an ENF database of a power supply department, and the method has high implementation difficulty and high cost; and secondly, extracting certain characteristics in the ENF signal, and carrying out consistency and regularity analysis. The current research method for audio tampering evidence obtaining by using ENF signals mainly utilizes the traditional machine learning method to classify the characteristics of the ENF signals, such as phase change, phase discontinuity, instantaneous frequency mutation and the like, so as to achieve the purpose of tampering detection.
In the existing digital audio detection methods, a threshold is set for corresponding features to detect or classify by adopting a machine learning method. These methods often suffer from too much empirical content or are too targeted and insufficiently identifiable for a certain tamper method.
In recent years, with the improvement of the performance of machine learning algorithms and the improvement of the storage and computing power of computers, deep neural networks (Deep Neural Network, DNN) are applied to the field of audio tamper detection. The method can better fit the audio tampering characteristics through DNN deep nonlinear transformation in the deep neural network, realizes automatic learning and detection, and has the advantage of high recognition rate. The method aims at solving the problems that the characteristic information of the existing method is single and the power grid frequency information cannot be fully utilized. Therefore, the invention provides an audio tampering detection method based on fusion of power grid frequency space and time sequence characteristics. The method comprises the steps of firstly taking an ENF phase characteristic matrix as a spatial characteristic, and acquiring ENF spatial information by using a convolutional neural network. And taking the ENF phase time sequence characterization as a time sequence characteristic, and acquiring ENF time sequence information by using the Bi-LSTM network. And then fusing the space and time sequence information through an attention mechanism, and finally classifying real audio and tampered audio by using a DNN classifier.
Disclosure of Invention
The technical problems of the invention are mainly solved by the following technical proposal:
a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of
Processing the audio data to be detected to obtain an Electric Network Frequency (ENF) component, and processing the ENF component based on DFT 1 conversion to obtain an ENF phaseAnd
Determining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
And acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information.
The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features is characterized by comprising the steps of processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprising the following steps:
downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;
A10000-order linear zero-phase FIR filter is used for narrow-band filtering, the center frequency is at the ENF standard, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB.
In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the acquiring the ENF phase comprises:
Step 2.1, calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d represents an approximate derivative operation, X ENFC [ n ] represents the value of the nth point of the ENF component;
Step 2.2, frame-dividing and windowing X ENFC n and X' ENFC n, the frame length is 10 standard ENF frequency periods Frame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length;
Step 2.3, performing N-point Discrete Fourier Transform (DFT) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k);
Step 2.4, let k peak be the index of the peak value of |x (k) |; k peak is used for solving
Step 2.5, from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
Step 2.6, reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |; and multiplying |X' (k) | by a scale factor F (k)
Obtaining DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) |; therefore, the estimated frequency value is
Step 2.7, k peak should be the integer nearest to f DFT1NDFT/fd (f d is the resampling frequency), soIt is a reasonable frequency value; can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents a maximum integer smaller than a, ceil [ b ] represents a minimum integer larger than b;
Due to Thus in (k lowlow)=arg[X′(klow) ]
(K highhigh)=arg[X′(khigh) ] linear interpolation can approximate the pointThe obtained value is consistent with the value of θ in the above formula;
Step 2.8, find With two possible values, useFor reference, selectIs closest toAs a final value of (2)
The method for calculating the ENF space feature matrix specifically comprises the following steps of:
step 3.1, acquiring audio data with the longest duration in the audio data to be detected;
step 3.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 3.3, calculating the longest phase
Step 3.4, calculating the frame length m,Wherein the method comprises the steps ofWherein m is the frame length of the frequency characteristic matrix;
step 3.5, calculating the phase of all the audio data
Step 3.6, calculating frame shift and framing; frame shift to
And 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a characteristic matrix P n×n.
The method for passively detecting digital audio tampering based on the fusion of the power grid frequency space and the time sequence features, the specific method for calculating the ENF time sequence characterization comprises the following steps:
Step 4.1, acquiring the longest duration audio data in the audio data to be detected;
Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 4.3, setting the frame length m and according toCalculating the number of frames
Step 4.4, for all audio data; calculating a frame shift overlap=m-floor (length (phi)/n);
step 4.5, dividing the sub-frame into two parts due to the condition that the sub-frame cannot be divided completely Frame shift ratio of framesFrame size 1; k=length (Φ) - (m-overlap) ×n
Step 4.6, ENF phase timing characterization as
In the above method for passive detection of digital audio tampering based on fusion of power grid frequency space and time sequence features, the network model part includes:
Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32 and 64, the size of the convolution kernels is 3 multiplied by 3, the step length is 1, and the Maxpooling layer poolsize is 3); the last pooling layer outputs ENF space information;
Step 4.2, acquiring time sequence information through a Bi-LSTM network; training the ENF phase time sequence representation by adopting two Bi-directional long-short-term memory neural network Bi-LSTM modules, and outputting the state of each time step; each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu;
Step 4.3, splicing the space and the time sequence characteristics to obtain a characteristic quantity with the length of L;
Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the purpose of the method is to compress the characteristics, and the method is to refer to a SE (sequence-and-specification) network, increase the nonlinearity of the network by a compression method and acquire more accurate weights; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;
step 4.5, multiplying the obtained weight by the space and time sequence characteristics before the attention mechanism to weight; different weights can be given to the spatial and time sequence characteristics through automatic learning, and the characteristics with large influence on the classification result are given larger weights so as to improve the detection precision;
Step 4.6, fitting and classifying the fused space and time sequence characteristics; fully fitting the features by adopting two fully connected layers (the number of neurons is 1024 and 256 respectively, and the activation function is Relu); a Dropout layer (Dropout rate=0.2) is added between the two fully connected layers to prevent overfitting; finally, through the fully connected layer (neuron number 2, activation function Softmax) as the output layer;
And 4.7, finally obtaining the probability of the output layer to obtain whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics is characterized by comprising
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
Therefore, the invention has the following advantages: the invention provides a deep learning method for classifying ENF space and time sequence feature fusion. Aiming at the problem that the traditional method is insufficient in feature expression and ENF time sequence information is not fully utilized, the problem that the traditional audio tampering detection method is single in feature is solved by adopting a space and time sequence feature fusion method, ENF changes in audio are more comprehensively described, and algorithm detection accuracy is improved. The Attention mechanism Attention is utilized to fuse the features, and useful information is acquired from the audio ENF through automatic learning and used for tampering detection classification tasks, so that the interference of invalid information on a final result is reduced. Compared with the traditional digital audio tamper detection method, the digital audio tamper detection method can effectively improve the identification performance of the system, improve the generalization capability of the model, optimize the system structure and improve the competitiveness of corresponding equipment source identification products.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a structural diagram of a neural network.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Examples:
The invention relates to a digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence characteristics, wherein an algorithm flow chart of the method is shown in figure 1 and can be divided into four parts: 1) Acquiring an ENF component; 2) Extracting ENF phase characteristics; 3) Acquiring an ENF space feature matrix; 4) Acquiring ENF time sequence characterization; 5) And training a neural network.
Step one: the ENF component is obtained by the following steps:
A. downsampling the audio with a resampling frequency of 1000HZ or 1200HZ;
B. using 10000-order linear zero-phase FIR filter to carry out narrow-band filtering, wherein the center frequency is at the ENF standard (50 HZ or 60 HZ), the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB;
step two: the ENF phase feature extraction comprises the following steps:
A. Calculating a signal first derivative, framing and windowing, discrete Fourier transformation, linear interpolation and estimating phase, and calculating phase fluctuation characteristics:
(A-1) calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d denotes an approximate derivative operation, and X ENFC [ n ] denotes the value of the nth point of the ENF component.
(A-2) frame-windowing X ENFC n and X' ENFC n, the frame length being 10 standard ENF frequency periodsFrame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length.
(A-3) performing N-point Discrete Fourier Transforms (DFTs) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k), respectively.
(A-4) let k peak be the index of the peak value of |X (k) |. k peak is used for solving
(A-5) from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
(A-6) reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |. And multiplying |X' (k) | by a scale factor F (k)
DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) | is obtained. Therefore, the estimated frequency value is
(A-7) k peak should be closest toInteger (f d is the resampling frequency), such thatIt is a reasonable frequency value. Can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents the largest integer smaller than a, ceil [ b ] represents the smallest integer larger than b.
Due toThus, linear interpolation at (k lowlow)=arg[X′(klow) ] and (k highhigh)=arg[X′(khigh) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula.
(A-8) the method of obtainingHaving two possible values and therefore usingFor reference, selectIs closest toAs a final value of (2)
Step three: the ENF space feature matrix is acquired by the following steps:
A. Calculating ENF space feature matrix
(A-1) obtaining the longest-duration audio data among the audio data to be detected.
(A-2) for the longest duration Audio, DFT conversion obtains phase
(A-3) calculating the longest phase
(A-4) calculating a frame length m,Wherein the method comprises the steps ofWhere m is the frequency characteristic matrix frame length.
(A-5) calculating the phases of all the Audio data
(A-6) calculating a frame shift and framing. Frame shift to
(A-7) performing Reshape on the phase and the frequency after framing to obtain a feature matrix P n×n.
Step four: the ENF phase sequence characterization acquisition comprises the following steps:
A. Calculating ENF phase timing characterization
(A-1) obtaining the longest-duration audio data among the audio data to be detected.
(A-2) for the longest duration Audio, DFT conversion obtains phase
(A-3) setting the frame Length m and according toCalculating the number of frames
(A-4) for all audio data. Frame shift overlap=m-floor (length (Φ)/n) is calculated.
(A-5) dividing the frame into two parts due to the fact that there is a case where division is impossible Frame shift ratio of framesThe frame is 1 smaller. k=length (Φ) - (m-overlap) ×n
(A-6) ENF phase timing characterization as
Step five: the network model comprises the following steps:
A. Spatial information is obtained through a convolutional neural network. The feature matrix P n×n is processed with two convolution blocks to obtain ENF spatial information, each convolution block is composed of two identical convolution layers and one pooling layer (the number of convolution kernels of the two convolution blocks is 32, 64. The convolution kernels are 3×3, and the step size is 1.Maxpooling layer poolsize is 3). The last pooling layer outputs ENF spatial information.
B. And acquiring time sequence information through the Bi-LSTM network. Two Bi-directional long-short-term memory neural network Bi-LSTM modules are adopted to train the ENF phase time sequence representation, and the state of each time step is output. Each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu.
C. The attention mechanism fuses spatial and timing information.
And (C-1) splicing the space and the time sequence features to obtain a feature quantity with the length of L.
(C-1) then passed through three fully connected layers, each of which had an activation function of Relu, but the number of neurons was L, L/8 and L, respectively. The purpose of this is to compress the features, and in this way, reference is made to a SE (Squeeze-and-specification) network, and the nonlinearity of the network is increased by the compression method, so as to obtain more accurate weights. After nonlinear operation, the weights of the space and time sequence features are obtained through a layer of fully connected layers with the number of neurons being L and the activation function being Sigmoid.
(C-1) multiplying the resulting weights by spatial and temporal characteristics prior to the attention mechanism. By automatic learning, different weights can be given to the spatial and time sequence features, and features with large influence on the classification result are given more weight, so that the detection accuracy is improved.
D. Fitting and classifying the fused space and time sequence features. The features were fully fitted using two fully connected layers (1024, 256 neurons number, relu activation function, respectively). A Dropout layer (Dropout rate=0.2) was added between the two fully connected layers to prevent overfitting. Finally, the output layer was made by the fully connected layer (neuron number 2, activation function Softmax).
E. Finally, the probability obtained by the output layer can be used for obtaining whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
The embodiment also relates to a digital audio tampering passive detection device based on the fusion of the frequency space and the time sequence characteristics of the power grid, which comprises
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
The specific implementation steps of the three modules adopt the steps from the first step to the fifth step, and are not repeated here.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the invention. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims (3)

1. A digital audio tampering passive detection method based on fusion of power grid frequency space and time sequence features is characterized by comprising the following steps of
Processing the audio data to be detected to obtain an Electric Network Frequency (ENF) component, and processing the ENF component based on DFT 1 conversion to obtain an ENF phaseAnd
Determining the frame number and frame length of the space representation and the time sequence representation according to the audio frequency of the longest duration to be detected, and respectively calculating the ENF phaseAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
Acquiring space information from the space feature matrix by utilizing a neural network, acquiring ENF time sequence information from the ENF phase time sequence characterization, and performing fusion, fitting and classification on the space information and the time sequence information;
Acquiring the ENF phase includes:
Step 2.1, calculating an approximate first derivative of the ENF signal X ENFC [ n ] at the point n
X′ENFC[n]=fd(XENFC[n]-XENFC[n-1]) (1)
Where f d represents an approximate derivative operation, X ENFC [ n ] represents the value of the nth point of the ENF component;
Step 2.2, frame-dividing and windowing X ENFC n and X' ENFC n, the frame length is 10 standard ENF frequency periods Frame shift to 1 standard ENF frequency periodWindowing X ENFC [ n ] and X' ENFC [ n ] with a Hanning window w (n)
XN[n]=XENF[n]w(n) (2)
X′N[n]=X′ENFC[n]w(n) (3)
Hanning window thereinL is the window length;
Step 2.3, performing N-point Discrete Fourier Transform (DFT) on each frame of signals X N [ N ] and X 'N [ N ] to obtain X (k) and X' (k);
Step 2.4, let k peak be the index of the peak value of |x (k) |; k peak is used for solving
Step 2.5, from the estimated frequency f DFT of the ENF signal, the ENF phase characteristics can be found
Step 2.6, reevaluation of the DFT 1 transformed ENF phaseLet k peak be the index of the peak of |x' (k) |; and multiplying |X' (k) | by a scale factor F (k)
Obtaining DFT 0[k]=X(k),DFT1 [ k ] =f (k) |x' (k) |; therefore, the estimated frequency value is
Step 2.7, k peak should be closest toInteger (f d is the resampling frequency), such thatIt is a reasonable frequency value; can be used forRepresented as
Wherein the method comprises the steps ofFor the value of θ, the value is obtained by linear interpolation from X' (k), and the value of θ is calculated byFloor [ a ] represents a maximum integer smaller than a, ceil [ b ] represents a minimum integer larger than b;
Due to Thus, linear interpolation at (k lowlow)=arg[X′(klow) ] and (k highhigh)=arg[X′(khigh) ] can approximate the pointThe obtained value is consistent with the value of θ in the above formula;
Step 2.8, find With two possible values, useFor reference, selectIs closest toAs a final value of (2)
The specific method for calculating the ENF space feature matrix comprises the following steps:
step 3.1, acquiring audio data with the longest duration in the audio data to be detected;
step 3.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 3.3, calculating the longest phase
Step 3.4, calculating the frame length m,Wherein the method comprises the steps ofWherein m is the frame length of the frequency characteristic matrix;
step 3.5, calculating the phase of all the audio data
Step 3.6, calculating frame shift and framing; frame shift to
Step 3.7, carrying out Reshape on the phase and the frequency after framing to obtain a feature matrix P n×n;
The specific method for calculating the ENF time sequence characterization comprises the following steps:
Step 4.1, acquiring the longest duration audio data in the audio data to be detected;
Step 4.2, for the longest duration audio, obtaining the phase by DFT conversion
Step 4.3, setting the frame length m and according toCalculating the number of frames
Step 4.4, for all audio data; calculating a frame shift overlap=m-floor (length (phi)/n);
step 4.5, dividing the sub-frame into two parts due to the condition that the sub-frame cannot be divided completely Frame shift ratio of framesFrame size 1; k=length (Φ) - (m-overlap) ×n
Step 4.6, ENF phase timing characterization as
The network model part includes:
Step 4.1, acquiring spatial information through a convolutional neural network; processing the feature matrix P n×n by using two convolution blocks to obtain ENF space information, wherein each convolution block consists of two identical convolution layers and a pooling layer, and the number of convolution kernels of the two convolution blocks is 32 and 64; the convolution kernel size is 3×3, and the step size is 1; maxpooling layer poolsize is 3; the last pooling layer outputs ENF space information;
Step 4.2, acquiring time sequence information through a Bi-LSTM network; training the ENF phase time sequence representation by adopting two Bi-directional long-short-term memory neural network Bi-LSTM modules, and outputting the state of each time step; each Bi-LSTM module includes a Bi-directional LSTM layer, a layerNormalization layer, and an activation function leakyrelu;
Step 4.3, splicing the space and the time sequence characteristics to obtain a characteristic quantity with the length of L;
Step 4.4, then through three full-connection layers, the activation functions of the three full-connection layers are Relu, but the number of neurons is L, L/8 and L respectively; the aim of the method is to compress the characteristics, and the method refers to a Squeeze-and-specification network, and the nonlinearity of the network is increased by a compression method to acquire more accurate weight; after nonlinear operation, acquiring the weights of the space and time sequence characteristics through a layer of full-connection layer with the number of neurons being L and the activation function being Sigmoid;
step 4.5, multiplying the obtained weight by the space and time sequence characteristics before the attention mechanism to weight; different weights can be given to the spatial and time sequence characteristics through automatic learning, and the characteristics with large influence on the classification result are given larger weights so as to improve the detection precision;
Step 4.6, fitting and classifying the fused space and time sequence characteristics; the characteristics are fully fitted by adopting two full-connection layers, the number of neurons is 1024 and 256 respectively, and the activation function is Relu; adding Dropout layer (Dropout rate=0.2 to prevent overfitting) between two fully connected layers, and finally, using fully connected layers as output layers, the neuron number is 2, and the activation function is Softmax;
And 4.7, finally obtaining the probability of the output layer to obtain whether the voice to be tested is tampered or not, and calculating the probability of whether all the test voices are correctly recognized to be tampered or not, namely the recognition rate of the system.
2. The method for passively detecting digital audio tampering based on fusion of power grid frequency space and time sequence features as defined in claim 1, wherein the method for passively detecting digital audio tampering is characterized by processing an original voice signal to obtain a power grid frequency (ENF) component, and specifically comprises the following steps:
downsampling sets the signal resampling frequency to 1000HZ or 1200HZ;
A10000-order linear zero-phase FIR filter is used for narrow-band filtering, the center frequency is at the ENF standard, the bandwidth is 0.6HZ, the passband ripple is 0.5dB, and the stopband attenuation is 100dB.
3. A digital audio tampering passive detection device based on fusion of power grid frequency space and time sequence characteristics, adopting the method of any one of claims 1 to 2, comprising
A first module: is configured to process audio data to be detected to obtain an Electric Network Frequency (ENF) component, and process the ENF component based on DFT 1 transformation to obtain an ENF phaseAnd
A second module: is configured for calculating ENF phases respectivelyAndFrame shift corresponding to each other, and frame-dividing the frame shift corresponding to each other, by ENF phaseCarrying out Reshape on the obtained framing data to obtain an ENF space feature matrix, and carrying out phase adjustment by the ENFThe obtained framing data is split into two parts to form ENF time sequence characterization;
and a third module: the method is configured to acquire space information from the space feature matrix by using a neural network, acquire ENF time sequence information from ENF phase time sequence characterization, and perform fusion, fitting and classification on the space information and the time sequence information.
CN202210450835.1A 2022-04-26 2022-04-26 Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics Active CN114722964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210450835.1A CN114722964B (en) 2022-04-26 2022-04-26 Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210450835.1A CN114722964B (en) 2022-04-26 2022-04-26 Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics

Publications (2)

Publication Number Publication Date
CN114722964A CN114722964A (en) 2022-07-08
CN114722964B true CN114722964B (en) 2024-08-02

Family

ID=82245886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210450835.1A Active CN114722964B (en) 2022-04-26 2022-04-26 Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics

Country Status (1)

Country Link
CN (1) CN114722964B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118368483B (en) * 2024-06-19 2024-09-06 华侨大学 Method, device, equipment and medium for detecting video inter-frame tampering in power grid environment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284717A (en) * 2018-09-25 2019-01-29 华中师范大学 It is a kind of to paste the detection method and system for distorting operation towards digital audio duplication
CN112151067A (en) * 2020-09-27 2020-12-29 湖北工业大学 Passive detection method for digital audio tampering based on convolutional neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11069370B2 (en) * 2016-01-11 2021-07-20 University Of Tennessee Research Foundation Tampering detection and location identification of digital audio recordings
CN112489677B (en) * 2020-11-20 2023-09-22 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and medium based on neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284717A (en) * 2018-09-25 2019-01-29 华中师范大学 It is a kind of to paste the detection method and system for distorting operation towards digital audio duplication
CN112151067A (en) * 2020-09-27 2020-12-29 湖北工业大学 Passive detection method for digital audio tampering based on convolutional neural network

Also Published As

Publication number Publication date
CN114722964A (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN110728360B (en) Micro-energy device energy identification method based on BP neural network
CN112885372B (en) Intelligent diagnosis method, system, terminal and medium for power equipment fault sound
CN112151067B (en) Digital audio tampering passive detection method based on convolutional neural network
CN111986699B (en) Sound event detection method based on full convolution network
CN112446242A (en) Acoustic scene classification method and device and corresponding equipment
CN113707175B (en) Acoustic event detection system based on feature decomposition classifier and adaptive post-processing
WO2023226355A1 (en) Dual-ion battery fault detection method and system based on multi-source perception
CN114722964B (en) Digital audio tampering passive detection method and device based on fusion of power grid frequency space and time sequence characteristics
CN109658943A (en) A kind of detection method of audio-frequency noise, device, storage medium and mobile terminal
CN112529177A (en) Vehicle collision detection method and device
CN115393968A (en) Audio-visual event positioning method fusing self-supervision multi-mode features
Zhang et al. Temporal Transformer Networks for Acoustic Scene Classification.
CN116741159A (en) Audio classification and model training method and device, electronic equipment and storage medium
CN114067829A (en) Reactor fault diagnosis method and device, computer equipment and storage medium
CN115270906A (en) Passive digital audio tampering detection method and device based on power grid frequency depth layer feature fusion
CN116584956A (en) Single-channel electroencephalogram sleepiness detection method based on lightweight neural network
Čavor et al. Vehicle speed estimation from audio signals using 1d convolutional neural networks
CN114822590B (en) Digital audio tampering passive detection method and device based on power grid frequency phase timing sequence characterization
CN113177536B (en) Vehicle collision detection method and device based on deep residual shrinkage network
CN114121025A (en) Voiceprint fault intelligent detection method and device for substation equipment
PVSMS A deep learning based system to predict the noise (disturbance) in audio files
Luo et al. Sound-Convolutional Recurrent Neural Networks for Vehicle Classification Based on Vehicle Acoustic Signals
CN111048203B (en) Brain blood flow regulator evaluation device
CN118427670B (en) Online data monitoring method and system for battery exchange cabinet
CN118503893B (en) Time sequence data anomaly detection method and device based on space-time characteristic representation difference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant