CN114612936A

CN114612936A - Unsupervised abnormal behavior detection method based on background suppression

Info

Publication number: CN114612936A
Application number: CN202210252961.6A
Authority: CN
Inventors: 路文; 李玎; 朱志强; 朱振杰; 何立火
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-10
Anticipated expiration: 2042-03-15
Also published as: CN114612936B

Abstract

The invention provides an unsupervised abnormal behavior detection method based on background suppression, which comprises the following steps: (1) acquiring a training sample set and a test sample set; (2) constructing an unsupervised abnormal behavior detection network model; (3) carrying out iterative training on the unsupervised abnormal behavior detection network model H; (4) defining unsupervised abnormal behavior detection network model H^*Is abnormal score ofA function score; (5) and acquiring an abnormal behavior detection result. The unsupervised abnormal behavior detection network model constructed by the invention overcomes the defect that the influence of the background characteristics of the video frames on algorithm perception and the influence of the marking accuracy of the training set on supervised learning are not considered in the prior art, and improves the abnormal behavior identification accuracy of the abnormal behavior detection method.

Description

Unsupervised abnormal behavior detection method based on background suppression

Technical Field

The invention belongs to the technical field of computer vision, and relates to an abnormal behavior detection method, in particular to an unsupervised road monitoring video abnormal behavior detection method based on background suppression.

Background

Road monitoring is the most convenient and direct way to observe the behavior of passerby, and as the number of traffic accidents caused by the fact that passerby use sidewalks not according to traffic regulations increases, urgent needs for detecting abnormal behavior of passerby are generated.

In recent years, with the rapid development of deep learning and source data sets, intelligent monitoring equipment is correspondingly developed, abnormal behavior detection is the most widely applied function of the current intelligent monitoring equipment in daily life, and reliable safety guarantee is provided for the daily work and life of people. However, in the process of detecting passers-by, the current intelligent monitoring equipment with a built-in detection algorithm is easily influenced by factors such as ambient light, background targets, background similar characteristics and the like, and in addition, if a supervision abnormal behavior detection algorithm is adopted, the accuracy of a used manual labeling data set also influences the algorithm, finally, inevitable interference is introduced, the accuracy of abnormal behavior detection is reduced, and the robustness of the algorithm is weakened. Therefore, the accuracy of abnormal behavior detection and the robustness of the algorithm are important indexes for evaluating the performance of the abnormal behavior detection algorithm.

In the patent document "abnormal behavior detection method based on deep learning" (patent application number: CN 202110611720.1; application publication number: CN113361370A) applied by Nanjing industry university, an abnormal behavior detection method based on deep learning is disclosed, which includes the steps of firstly, obtaining an RGB image of an actual scene by using a camera, then, detecting pedestrians in a current video frame by using a YOLOv5 algorithm, outputting position information, confidence and category of a detection frame, performing cascade matching on adjacent frame targets by using a constructed appearance feature network to obtain a matched track, and finally, deleting, creating and tracking a track result by using a Kalman prediction method to obtain a final track and matching the final track with a next frame, so that the cycle is performed. The method has the disadvantages that firstly, the method does not consider the influence of the background characteristics of the video frame on algorithm perception, so that the accuracy of the abnormal behavior detection algorithm is influenced under the interference of background information, secondly, the YOLOv5 algorithm adopted in the method is a supervision algorithm, and the accuracy of the detection algorithm is also influenced by the labeling accuracy of pedestrians in a manually labeled data set when the YOLOv5 algorithm is trained.

In patent document "a violent abnormal behavior detection method based on deep learning" applied by the university of Harbin's rational engineering (patent application No. CN 202110224967.8; application publication No. CN113191182A), a violent abnormal behavior detection method is proposed. The method comprises the steps of firstly carrying out framing processing on videos in a data set to obtain video frames, then stacking a plurality of continuous frames to form a cube, extracting three-dimensional features in the cube by using a three-dimensional convolution neural network, carrying out feature fusion, and judging whether the extracted features have the features of forbidden articles such as knives, guns, sticks and sticks by using a YOLO algorithm. The method has two disadvantages that firstly, the method does not fully consider the interference of similar background information characteristics in the actual life scene on the foreground information. Secondly, the YOLO algorithm adopted in the method is a supervised algorithm, and the accuracy of the labeling of pedestrians in a manually labeled data set can also influence the accuracy of the detection algorithm when the YOLO algorithm is trained.

Disclosure of Invention

The invention aims to provide an unsupervised abnormal behavior detection method based on background suppression aiming at the defects of the prior art, and the unsupervised abnormal behavior detection method is used for solving the technical problem of low detection accuracy caused by neglecting the background information of the video to be detected and manually dividing a data set in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) acquiring a training sample set and a testing sample set:

(1a) randomly selecting M personal sidewalk monitoring videos for decomposition to obtain M frame sequence sets,

wherein

Denotes the m-th contains K_mA sequence of frames of a frame of the image,

v^kto represent

The K-th frame image, M is more than or equal to 200, K_m≥100；

(1b) From a set S of frame sequences_v1Each frame sequence involved

Respectively screened N only containing pedestrian walking events_mThe frame images form a normal behavior frame sequence

And all normal behavior frame sequences contained in the M frame sequences form a training sample set B_trainThen will be

P remaining in_mAbnormal behavior frame sequence formed by frame images

Then all abnormal behavior frame sequences are combined into a test sample set B_testWherein N is_m≥P_m，P_m＝K_m-N_m；

(2) Constructing an unsupervised abnormal behavior detection network model H:

(2a) constructing an unsupervised abnormal behavior detection network model H of a background suppression module, a prediction module and a background suppression constraint module which are connected in sequence, wherein the output end of the background suppression module is also connected with a context memory module; wherein:

the prediction module comprises a space encoder, a convolution long-term and short-term memory module and a decoder which are sequentially connected, wherein the space encoder adopts a feature extraction network comprising a plurality of two-dimensional convolution layers and a plurality of activation function layers; the convolution long-term and short-term memory module adopts a memory convolution neural network comprising a plurality of two-dimensional convolution layers, a plurality of tensor decomposition layers and a plurality of activation function layers; the decoder adopts a transposed convolutional neural network comprising a plurality of two-dimensional transposed convolutional layers and a plurality of activation function layers;

the context memory module comprises a motion matching encoder and a memory module which are connected in sequence, wherein the motion matching encoder adopts a three-dimensional convolutional neural network comprising a plurality of three-dimensional convolutional layers, a plurality of activation function layers, a plurality of three-dimensional maximum pooling layers and 1 three-dimensional average pooling layer;

the output end of the memory module in the context memory module is connected with the input end of the decoder in the prediction module;

(2b) background suppression loss function L defining a background suppression constraint module_BGSBackground constrained loss function L_restrainMinimum square error L₂Minimum absolute value deviation L₁：

L_restrain＝L_BGS+L₂+L₁

Wherein | · | purple sweet₁Representing 1 norm, Binary (·) representing binarization,

to represent

The result of the prediction of (a) is,

to represent

The nth frame image of (1);

(3) carrying out iterative training on the unsupervised abnormal behavior detection network model H:

(3a) the initial iteration time is T, the maximum iteration time is T, T is more than or equal to 80, and the parameter of the T-th iteration feature extraction network is theta_{G1_t}The memory convolutional neural network parameter is theta_{G2_t}Transposed convolutional neural network parameter of θ_{G3_t}The three-dimensional convolution neural network parameter is theta_{G4_t}Let t be 1;

(3b) will train sample set B_trainObtaining the t-th iteration time frame sequence as the input of an unsupervised abnormal behavior detection network model H

Predicted result of (2)

(3b1) Background suppression module pair training sample set B_trainOf each normal behavior frame sequence

Each normal behavior frame image in (1)

Making background informationInhibiting to obtain M frame sequences after background inhibition;

(3b2) frame sequence with background suppression by spatial coder in prediction module

Each frame image in the image processing system is subjected to feature extraction, and a convolution long-term and short-term memory module pair

Feature tensor of all extracted feature components

Decomposing to obtain

Characteristic information of

And store, c is [2, M-1 ]]；

(3b3) Context memorization module for frame division sequence

Extracting features of each frame image in M-1 normal behavior frame sequences except the image sequence

The features of all previous frame images constitute the above information

And store, at the same time, will

The features of all the subsequent frame images constitute context information

And storing;

(3b4) the decoder in the prediction module is used for the step (3b2)Characteristic information of

And the above information obtained in step (3b3)

And context information

Decoding to obtain the t-th iteration time frame sequence

Predicted result of (2)

(3c) Background suppression constraint module pairs prediction results

And normal behavior frame sequences

Normal behavior frame image in

Performing binarization processing to obtain prediction result at t moment

Is performed on the binary image

Nth normal behavior frame image

Is performed on the binary image

(3d) Using a background suppression loss function L_BGSBy passing

And

calculate H_tBackground suppression loss value L of_BGSAnd using a background constrained loss function L_restrainThrough L_BGS、L₂And L₁Calculate H_tIs a background constraint loss value L_restrain；

(3e) Using a counter-propagating method and passing through L_restrainCalculate H_tGradient of network parameters, then by a random gradient descent method through H_tNetwork parameter gradient of (a) to network parameter theta_{G1_t}、θ_{G2_t}、θ_{G3_t}、θ_{G4_t}Updating to obtain the unsupervised abnormal behavior detection network model H of the iteration_t；

(3f) Judging whether T is more than or equal to T, if so, obtaining a trained unsupervised abnormal behavior detection network model H^*Otherwise, let t be t +1, H_tH, and performing step (3 b);

(4) acquiring an abnormal behavior detection result:

(4a) set B of test samples_testSequence of the c-th anomalously behaving frame

Unsupervised abnormal behavior detection network model H as trained^*Is forward propagated to obtain

Predicted frame image of

(4b) Using an anomaly score function score and by predicting the frame image

And frame image

Computing

And judging whether F and a preset abnormal score detection threshold I meet the condition that F is not less than I, if so, judging that F is not less than I

There is abnormal behavior, whereas there is no abnormal behavior, wherein:

compared with the prior art, the invention has the following advantages:

firstly, in the invention, because the constructed abnormal behavior detection network model comprises a background suppression module and a background suppression constraint module, in the process of training the model and acquiring the detection result, in consideration of the influence of background target characteristic information on foreground abnormal detection, the abnormal behavior detection network model firstly weakens static background information by means of the background suppression module, then suppresses dynamic background information by means of the background suppression constraint module, finally strengthens the information of the foreground target, avoids the false detection defect caused by only considering the foreground information and neglecting the background information in the prior art, and effectively improves the detection accuracy.

Secondly, the invention realizes unsupervised abnormal behavior detection by means of a spatial encoder and a decoder due to the fact that a prediction module contained in the constructed abnormal behavior detection network model is connected with the spatial encoder, the convolution long-term and short-term memory module and the decoder in sequence, and overcomes the influence of the accuracy of a manual labeling data set on supervised learning, so that the invention has the advantage of strong robustness under different data sets.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Fig. 2 is a schematic structural diagram of an abnormal behavior detection network model constructed by the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

Referring to fig. 1, the present invention includes the steps of:

step 1) obtaining a training sample set and a testing sample set:

wherein

Denotes the m-th contains K_mA sequence of frames of a frame of the image,

v^kto represent

The K-th frame image, M is more than or equal to 200, K_m≥100；

In this example, experiments show that when M is 200, the training speed is fast, and the detection effect of the model is good.

(1b) From a set S of frame sequences_v1Each frame sequence involved

P remaining in_mAbnormal behavior frame sequence formed by frame images

In this example, walking of a pedestrian appearing in the sidewalk monitoring video is defined as a normal behavior, and riding a bicycle and a skateboard are defined as an abnormal behavior.

Step 2), constructing an unsupervised abnormal behavior detection network model H:

(2a) constructing an unsupervised abnormal behavior detection network model H of a background suppression module, a prediction module and a background suppression constraint module which are connected in sequence, wherein the output end of the background suppression module is also connected with a context memory module; the prediction module comprises a space encoder, a convolution long-term and short-term memory module and a decoder which are sequentially connected, wherein the space encoder adopts a feature extraction network comprising a plurality of two-dimensional convolution layers and a plurality of activation function layers; the convolution long-term and short-term memory module adopts a memory convolution neural network comprising a plurality of two-dimensional convolution layers, a plurality of tensor decomposition layers and a plurality of activation function layers; the decoder adopts a transposed convolutional neural network comprising a plurality of two-dimensional transposed convolutional layers and a plurality of activation function layers; the context memory module comprises a motion matching encoder and a memory module which are connected in sequence, and the output end of the memory module is connected with the input end of a decoder in the video prediction module; the motion matching encoder adopts a three-dimensional convolution neural network comprising a plurality of three-dimensional convolution layers, a plurality of activation function layers, a plurality of three-dimensional maximum pooling layers and a three-dimensional average pooling layer;

the number of the two-dimensional convolution layer and the number of the activation function layer which are contained in the space encoder are both 4, and the specific structure of the space encoder is as follows: the first two-dimensional convolution layer → the first activation function layer → the second two-dimensional convolution layer → the second activation function layer → the third two-dimensional convolution layer → the third activation function layer → the fourth two-dimensional convolution layer → the fourth activation function layer; wherein the input channel of the first two-dimensional convolutional layer is 1, the output channel is 64, and the step length is 2; the input channel of the second two-dimensional convolutional layer is 64, the output channel is 64, and the step length is 1; the third two-dimensional convolutional layer has an input channel of 64, an output channel of 128 and a step length of 2; the fourth two-dimensional convolutional layer has an input channel of 128, an output channel of 128 and a step length of 1; the convolution kernels used by the 4 two-dimensional convolution layers are all 3 multiplied by 3 in size; the 4 activation function layers all adopt ELU functions;

because each frame sequence in the example is obtained after the video decomposition, the frame image feature information in the frame sequence has strong correlation, and compared with the prior art in which only a common convolutional neural network is used for extracting the frame image feature information, the example uses a spatial encoder pair

The feature extraction is carried out on each frame image, so that the extracted feature information has strong relevance, and the feature information can obtain better decoding effect when being decoded in a decoder.

The convolution long-term memory module, it contains that the number of two-dimentional convolution layer and tensor decomposition layer is 2, and the number of activation function layer is 3, and concrete structure is: the first two-dimensional convolution layer → the second two-dimensional convolution layer → the first tensor decomposition layer → the second tensor decomposition layer → the first activation function layer → the second activation function layer → the third activation function layer; wherein the first two-dimensional convolutional layer and the second two-dimensional convolutional layer are the same, the input channel is 128, and the output channel is 128; 3 activation function layers all adopt sigmoid functions;

the decoder, its two-dimentional transposition convolution layer that contains number is 4, and the number of activation function layer is 3, and the concrete structure is: a first two-dimensional transposed convolution layer → a first activation function layer → a second two-dimensional transposed convolution layer → a second activation function layer → a third two-dimensional transposed convolution layer → a third activation function layer → a fourth two-dimensional transposed convolution layer; wherein the input channel of the first two-dimensional transpose convolution layer is 256, the output channel is 128, and the step length is 1; the second two-dimensional transpose convolution layer has an input channel of 128, an output channel of 64, and a step size of 2; the third two-dimensional transpose convolution layer has 64 input channels, 64 output channels and 1 step length; the fourth two-dimensional transpose convolution layer has an input channel of 64, an output channel of 1 and a step length of 1; convolution kernels used by the 4 two-dimensional transposition convolution layers are all 3 multiplied by 3 in the same size, and 3 activation function layers all adopt ELU functions;

the motion matching encoder comprises 6 three-dimensional convolution layers and 6 activation function layers, wherein the number of three-dimensional maximum pooling layers is 4, the number of three-dimensional average pooling layers is 1, and the specific structure is as follows: the first three-dimensional convolution layer → the first activation function layer → the first three-dimensional maximum pooling layer → the second three-dimensional convolution layer → the second activation function layer → the second three-dimensional maximum pooling layer → the third three-dimensional convolution layer → the third activation function layer → the fourth three-dimensional convolution layer → the fourth activation function layer → the third three-dimensional maximum pooling layer → the fifth three-dimensional convolution layer → the fifth activation function layer → the sixth three-dimensional convolution layer → the sixth activation function layer → the fourth three-dimensional maximum pooling layer → the average three-dimensional pooling layer; wherein the first three-dimensional convolution layer input channel is 1, and the output channel is 64; the second three-dimensional convolutional layer has an input channel of 64 and an output channel of 128; the third three-dimensional convolution layer has an input channel of 128 and an output channel of 256; the input channel of the fourth three-dimensional convolution layer is 256, and the output channel is 256; the input channel of the fifth three-dimensional convolution layer is 256, and the output channel is 512; the input channel of the sixth three-dimensional convolution layer is 512, and the output channel is 512; the step lengths are all 1; convolution kernels used by the 6 three-dimensional convolution layers are all 3 multiplied by 3 in size; the size of the first three-dimensional maximum pooling layer pooling core is 1 multiplied by 2, and the step length is 1 multiplied by 2; the sizes of the second three-dimensional maximum pooling layer pooling core, the third three-dimensional maximum pooling layer pooling core and the fourth three-dimensional maximum pooling layer pooling core are all 2 multiplied by 2, and the step lengths are all 2 multiplied by 2; the average three-dimensional pooling layer convolution kernel size is 1 multiplied by 2; the 6 activation function layers all adopt a ReLU function;

L_restrain＝L_BGS+L₂+L₁

to represent

The result of the prediction of (a) is,

to represent

The nth frame image of (1);

in this example, the loss function L is constrained if the background_restrainUsing only the least square error L₂And background rejection loss function L_BGSCalculating loss of unsupervised abnormal behavior detection network model, although prediction result can be guaranteed

And normal behavior frame images

Image ofSimilarity of elements, but also ease of prediction

Blurring occurs and therefore to alleviate

Will deviate from the minimum absolute value by L₁A background constraint penalty function L is also added_restrainAnd calculating the loss of the unsupervised abnormal behavior detection network model.

Step 3) carrying out iterative training on the unsupervised abnormal behavior detection network model H:

in this example, when the maximum iteration number is T100, the trained unsupervised abnormal behavior detection network model has the best detection effect;

Predicted result of (2)

Each normal behavior frame image in (1)

Performing background information suppression, and suppressing all background informationThe frame image of (2) constitutes a frame image sequence, and the implementation steps are as follows:

background suppression module pair training sample set B_trainOf each normal behavior frame sequence

Each normal behavior frame image in (1)

Adjusting the illumination of the frame image by gamma correction, and correcting the gamma-corrected frame image

Gaussian filtering is carried out to remove noise points in the frame image, and then the frame image after Gaussian filtering is carried out

Performing laplacian sharpening to inhibit background information to obtain a frame image with the background information inhibited

Feature tensor composed of all extracted features

Decomposing to obtain

Characteristic information of

And store，c∈[2,M-1]The process is as follows:

spatial encoder pairs frame sequences by convolutional layers and activation function layers in feature extraction networks

Each frame image in the image processing system is subjected to feature extraction and stacked to obtain a feature tensor

The convolution long-short term memory module utilizes convolution layer, tensor decomposition layer and activation function layer pair

Decomposing to obtain characteristic information

(3b3) Context memorization module for frame division sequence

The features of all previous frame images constitute the above information

And store while at the same time

The features of all subsequent frame images constitute context information

And storing, the process is as follows:

for dividing frame sequence

Besides, all the frames are combinedEach frame image in the sequence is subjected to feature extraction by means of a three-dimensional convolutional neural network and the extracted features are encoded, and the frame sequence

All previous frame sequences

As the above information

And storing, a sequence of frames

All subsequent frame sequences

As the following information

And stored.

(3b4) The decoder in the prediction module compares the feature information obtained in step (3b2)

And the above information obtained in step (3b3)

And context information

Decoding to obtain the t-th iteration time frame sequence

Predicted result of (2)

The process is as follows:

decoder for the above information by means of transposed convolutional neural networks

Context information

And frame sequences

Characteristic information of

The formed tensors are transposed and decoded to obtain the frame sequence of the t iteration time

Predicted result of (2)

The decoder in the prediction module in this example uses simultaneously the sequence of frames extracted by the spatial encoder

The characteristic information and the characteristic information obtained by extracting the characteristics of other frame sequences are decoded by the motion matching encoder, so that the prediction results are more various, and the intelligent degree of the model is higher.

(3c) Background suppression constraint module pairs prediction results

And normal behavior frame sequences

Normal behavior frame image in

Performing binarization processing to obtain prediction result at t moment

Is generated from the binary image

Nth normal behavior frame image

Is generated from the binary image

Predicted results

And normal behavior frame sequences

Normal behavior frame image in

The background suppression constraint module performs binarization processing to change all pixel values of the frame image which are not 0 to 1.

Because the foreground object and the background object both move continuously in the video, and the change of the pixel value is continuous, when the moving object passes through a certain area, the pixel value of the area changes, and the fluctuation of the pixel value is also taken as potential feature extraction in the process of extracting the feature by the algorithm, thereby causing false detection.

In this example, the binarization process would be to normally-behave frame images

And predicting the result

All the pixel values which are not 0 in the background image are changed into 1, and then the problem that the pixel value of a moving target passing area is not 0 caused by target motion is solved through the difference frame of the two pixel values, so that dynamic background information is suppressed, and the accuracy of detection is improved.

(3d) Using a background suppression loss function L_BGSDisclosure of the inventionFor treating

And

(3e) Using a counter-propagating method and passing through L_restrainCalculate H_tGradient of network parameters, then by H using a random gradient descent method_tNetwork parameter gradient of (a) to network parameter theta_{G1_t}、θ_{G2_t}、θ_{G3_t}、θ_{G4_t}Updating to obtain the unsupervised abnormal behavior detection network model H of the iteration_t；

stochastic gradient descent algorithm through H_tNetwork parameter gradient pair H_tCharacteristic extraction network parameter theta of_{G1_t}Memorizing the convolution neural network parameter theta_{G2_t}Transposed convolutional neural network parameter θ_{G3_t}Three-dimensional convolutional neural network parameter theta_{G4_t}Updating, wherein the updating formula is as follows:

m_t＝β₁·v_t-1+(1-β₁)·g_t

wherein: g_tIs the gradient at the number of iterations t,

extracting network parameters theta for features, respectively_{G1_t}Memorizing the convolution neural network parameter theta_{G2_t}Transposed convolutional neural network parameter θ_{G3_t}Three-dimensional convolutional neural network parameter theta_{G4_t}Updated parameters, { f_ti(θ) | i ═ 1,2,3,4} is the parameter θ_{Gi_t}Objective function of, beta₁，β₂Exponential decay rates of the first and second moments, { m }, respectively_ti1,2,3,4 is H_tFirst moment estimation of network parameter gradients, { v }_tiI | ═ 1,2,3,4} is for H_tAn estimate of the second moment of the gradient of the network parameter,

is a pair { m_tiCorrection of i | 1,2,3,4},

is beta_iTo the power of t of (a),

for { v_tiCorrection of | i ═ 1,2,3,4 [ { α ]_iI | ═ 1,2,3,4} is the learning rate, { ε_iI | ═ 1,2,3,4} is a constant added to maintain numerical stability.

(3f) Judging whether T is more than or equal to T, if so, obtaining a trained unsupervised abnormal behavior detection network model H^*Otherwise, let t be t +1,H_th, and performing step (3 b);

step 4), obtaining an abnormal behavior detection result:

(4a) set B of test samples_testSequence of the c-th anomalously behaving frame

Predicted frame image of

(4b) Using an anomaly score function score and by predicting the frame image

And frame image

Calculating out

There is abnormal behavior, whereas there is no abnormal behavior, wherein:

the effect of the present invention will be further explained with reference to the following experiments:

1. the experimental conditions are as follows:

the hardware platform of the experiment of the invention is as follows: 2 blocks of NVIDIA GeForce GTX 2080Ti GPU.

The software platform of the experiment of the invention is as follows: ubuntu 16 operating system, Pytorch 1.7 framework, Python 3.8.

The data set used for the experiment was the ShanghaiTech data set, which had a total of 437 videos, each with different lighting conditions and camera angles.

2. Analysis of experimental contents and results thereof:

(1) evaluation index

The main evaluation index in the field of video monitoring abnormal behavior detection is the Area Under the Curve (AUC) of Receiver Operating Characteristic Curve (ROC). The ROC takes the false positive rate as the abscissa and the true positive rate as the ordinate. The false positive rate refers to the probability of predicting as a positive sample in all negative samples, and the true positive rate refers to the probability of predicting as a positive sample in all positive samples. The closer the ROC is to the upper left corner, the larger the AUC value, and the better the performance of the algorithm model. For the abnormal behavior detection task, AUC values are calculated based on image-level abnormality scores.

(3) Results and analysis of the experiments

The experiment is mainly used for verifying the advantages of the method and other existing abnormal behavior detection methods in the aspect of detection accuracy. In the experiment, various abnormal behavior detection methods are adopted to train and test on a ShanghaiTech data set, and finally, an evaluation index AUC on the data set is obtained.

Table 1 experimental results of different algorithms on ShanghaiTech dataset

Method	AUC
		Conv-AE	60.9％
StackedRNN	68％
		Liuetal.	72.8％
VEC	74.8％
		HF₂-VED	76.2％
The invention	76.5％

As can be seen from the experimental results of table 1, the present invention has higher accuracy compared to the prior art.

In conclusion, compared with the prior art, the method has higher detection accuracy rate on the abnormal behavior, and has important practical significance. While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. An unsupervised abnormal behavior detection method based on background suppression is characterized by comprising the following steps:

(1) acquiring a training sample set and a testing sample set:

wherein

Denotes the m-th contains K_mA sequence of frames of a frame of the image,

v^kto represent

The K-th frame image, M is more than or equal to 200, K_m≥100；

(1b) From a set S of frame sequences_v1Each frame sequence involved

P remaining in_mAbnormal behavior frame sequence formed by frame images

(2) Constructing an unsupervised abnormal behavior detection network model H:

L_restrain＝L_BGS+L₂+L₁

Wherein | · | charging₁Representing 1 norm, Binary (·) representing binarization,

to represent

The result of the prediction of (a) is,

to represent

The nth frame image of (1);

(3a) the initial iteration time is T, the maximum iteration time is T, T is more than or equal to 80, and the parameter of the T-th iteration feature extraction network is theta_{G1_t}Memory convolutional neural network parameter is theta_{G2_t}Transposed convolutional neural network parameter of θ_{G3_t}The three-dimensional convolution neural network parameter is theta_{G4_t}And let t equal to 1;

Predicted result of (2)

(3b1) Background suppression module pair training sample set B_trainIn each of the normal behavior frame sequences

Each normal behavior frame image in (1)

Inhibiting background information, and forming all frame images with the suppressed background information into a frame image sequence;

Feature tensor composed of all extracted features

Decomposing to obtain

Characteristic information of

And store, c is [2, M-1 ]]；

(3b3) Context memorization module for frame division sequence

The features of all previous frame images constitute the above information

And store while at the same time

All subsequent framesFeatures of an image constitute contextual information

And storing;

And the above information obtained in step (3b3)

And context information

Decoding to obtain the t-th iteration time frame sequence

Predicted result of (2)

(3c) Background suppression constraint module pairs prediction results

And normal behavior frame sequences

Normal behavior frame image in

Performing binarization processing to obtain prediction result at t moment

Is generated from the binary image

Nth normal behavior frame image

Is generated from the binary image

(3d) Using a background suppression loss function L_BGSBy passing

And

calculating H_tBackground suppression loss value L of_BGSAnd using a background constrained loss function L_restrainThrough L_BGS、L₂And L₁Calculate H_tIs a background constraint loss value L_restrain；

(4) acquiring an abnormal behavior detection result:

(4a) set B of test samples_testSequence of the c-th anomalously behaving frame

Predicted frame image of

(4b) Using an anomaly score function score and by predicting the frame image

And frame image

Computing

Abnormal behavior is present, whereas abnormal behavior is absent, wherein:

2. the background suppression-based unsupervised abnormal behavior detection method according to claim 1, wherein the unsupervised abnormal behavior detection network model H in step (2a) is a network model H in which:

convolution long short-term memory module, it contains the number that two-dimentional convolution layer and tensor decompose the layer and is 2, and the number of activation function layer is 3, and concrete structure is: the first two-dimensional convolution layer → the second two-dimensional convolution layer → the first tensor decomposition layer → the second tensor decomposition layer → the first activation function layer → the second activation function layer → the third activation function layer; wherein the first two-dimensional convolutional layer and the second two-dimensional convolutional layer are the same, the input channel is 128, and the output channel is 128; 3 activation function layers all adopt sigmoid functions;

the motion matching encoder comprises 6 three-dimensional convolution layers and 6 activation function layers, wherein the number of three-dimensional maximum pooling layers is 4, the number of three-dimensional average pooling layers is 1, and the specific structure is as follows: the first three-dimensional convolution layer → the first activation function layer → the first three-dimensional maximum pooling layer → the second three-dimensional convolution layer → the second activation function layer → the second three-dimensional maximum pooling layer → the third three-dimensional convolution layer → the third activation function layer → the fourth three-dimensional convolution layer → the fourth activation function layer → the third three-dimensional maximum pooling layer → the fifth three-dimensional convolution layer → the fifth activation function layer → the sixth three-dimensional convolution layer → the sixth activation function layer → the fourth three-dimensional maximum pooling layer → the average three-dimensional pooling layer; wherein the input channel of the first three-dimensional convolution layer is 1, and the output channel is 64; the second three-dimensional convolutional layer has an input channel of 64 and an output channel of 128; the third three-dimensional convolution layer has an input channel of 128 and an output channel of 256; the input channel of the fourth three-dimensional convolution layer is 256, and the output channel is 256; the fifth three-dimensional convolution layer input channel is 256 and the output channel is 512; the input channel of the sixth three-dimensional convolution layer is 512, and the output channel is 512; the step lengths are all 1; convolution kernels used by the 6 three-dimensional convolution layers are all 3 multiplied by 3 in size; the size of the first three-dimensional maximum pooling layer pooling core is 1 multiplied by 2, and the step length is 1 multiplied by 2; the sizes of the second three-dimensional maximum pooling layer pooling core, the third three-dimensional maximum pooling layer pooling core and the fourth three-dimensional maximum pooling layer pooling core are all 2 multiplied by 2, and the step lengths are all 2 multiplied by 2; the average three-dimensional pooling layer convolution kernel size is 1 multiplied by 2; the 6 activation function layers all adopt a ReLU function.

3. The background suppression-based unsupervised abnormal behavior detection method according to claim 1, wherein the background suppression module in step (3B1) applies the training sample set B_trainIn each of the normal behavior frame sequences

Each normal behavior frame image in (1)

The background information suppression is carried out, and the implementation steps are as follows:

background suppression module pair training sample set B_trainEach is normalBehavioral frame sequences

Each normal behavior frame image in (1)

Performing gamma correction, and subjecting the gamma-corrected frame image

Performing Gaussian filtering, and performing Gaussian filtering on the frame image

Performing Laplace sharpening to obtain a frame image with suppressed background information

4. The background suppression-based unsupervised abnormal behavior detection method according to claim 1, characterized in that: step (3e) is performed by a random gradient descent method through H_tNetwork parameter gradient of (a) to network parameter theta_{G1_t}、θ_{G2_t}、θ_{G3_t}、θ_{G4_t}Updating is carried out; the update formula is:

g_t＝▽_θf_t(θ_t-1)

m_t＝β₁·v_t-1+(1-β₁)·g_t

wherein: g is a radical of formula_tIs the gradient at the number of iterations t,

extracting network parameters theta for features, respectively_{G1_t}Memorizing the convolution neural network parameter theta_{G2_t}Transposed convolution neural network parameter θ_{G3_t}Three-dimensional convolutional neural network parameter theta_{G4_t}Updated parameters, { f_ti(θ) | i ═ 1,2,3,4} is the parameter θ_{Gi_t}Objective function of, beta₁，β₂Exponential decay rates of the first and second moments, { m }, respectively_ti1,2,3,4 is H_tFirst moment estimation of network parameter gradients, { v }_tiI | ═ 1,2,3,4} is for H_tAn estimate of the second moment of the gradient of the network parameter,

is a pair { m_tiCorrection of 1,2,3,4 |,

is beta_iTo the power of t of (a),

for { v_tiCorrection of | i ═ 1,2,3,4 { α }_iI | ═ 1,2,3,4} is the learning rate, { ε_iI | ═ 1,2,3,4} is a constant added to maintain numerical stability.