KR102134902B1

KR102134902B1 - Frameworking method for violence detection using spatiotemporal characteristic analysis of shading image based on deep learning

Info

Publication number: KR102134902B1
Application number: KR1020180140481A
Authority: KR
Inventors: 방승온
Original assignee: (주)지와이네트웍스
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-07-17
Also published as: JP2020087400A; KR20200057834A; JP6668514B1

Abstract

본 발명은 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워크에 관한 것으로, 영상의 폭력성 검출능력 및 정확도를 향상시키기 위한 것이다.
이를 위하여 본 발명은 비디오 카메라 또는 동영상 파일에서 주어지는 영상 프레임들로 구성된 입력 영상에서 폭력의 특징점을 검출하여 영상의 폭력성을 검출하는 폭력검출 프레임워크에 있어서, 실시간 입력 영상을 각 프레임당 영상으로 나누는 제1단계, 분리된 각 프레임당 영상에서 적(R),녹(G),청(B)을 제외하여 2D 기반의 Y 프레임 흑백 영상으로 추출하는 제2단계, 추출된 2D 기반의 Y 프레임 흑백 영상을 다수 개를 순차적으로 축적하여 3D 환경의 Y 프레임 흑백 영상으로 전환하는 제3단계, 및 전환된 3D 환경의 Y 프레임 흑백 영상 중에서 균등한 레이어의 프레임을 추출하고 다시 축적하여 영상 컨볼루션을 수행하고, 3*3*3 필터를 사용하여 원하는 검출 장면을 도출하는 제4단계를 포함하여, 네트워크 경량화 및 시간 공간에 최적화된 영상을 만들고 알고리즘에 적용하여 영상 컨볼루션 과정에서 특정 레이어에 폭력의 특징점을 지속적으로 기억시키고 재학습할 수 있도록 하여 영상의 폭력성 검출능력 및 정확도를 향상시키고 분석 프레임의 길이에 구애받지 않고 분석이 가능하며 연속된 행동에 대한 분석이 가능하게 한다.The present invention relates to a framework for detecting violence using spatial and temporal characteristics analysis of deep learning-based shadow images, and to improve the ability and accuracy of detecting violence in images.
To this end, the present invention is a violent detection framework that detects a violent characteristic of an image by detecting a feature point of violence in an input image composed of video frames provided from a video camera or a video file. Step 1, extracting the 2D-based Y-frame black and white image by excluding the red (R), green (G), and blue (B) from each separated frame image, and the extracted 2D-based Y frame monochrome image A third step of sequentially accumulating a large number of 3D environments and converting them into Y-frame black and white images in a 3D environment, and extracting and accumulating frames of equal layers among the Y-frame monochrome images in the converted 3D environment to perform image convolution. Including the 4th step of deriving the desired detection scene using 3*3*3 filters, network-weighted and time-space-optimized video is created and applied to the algorithm to apply the feature points of violence to specific layers in the video convolution process. By continuously remembering and re-learning, it improves the violent detection ability and accuracy of the image, enables analysis regardless of the length of the analysis frame, and enables analysis of continuous behavior.

Description

Framework for violent detection using spatiotemporal characteristic analysis of shading image based on deep learning}

본 발명은 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법에 관한 것으로, 보다 상세하게는 폭력영상 컨볼루션 수행시에 지나온 특징점을 쉽게 잃어버리게 되는 부분을 개선하여 영상의 폭력성 검출능력 및 정확도를 향상시키고, 작은 필터의 사용으로 분석 프레임의 길이에 관계없이 분석이 가능하며, 필터의 시간축 이동을 통해 연속 프레임에 대한 학습이 가능하게 하여 연속된 행동에 대한 분석이 가능하도록 한 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법에 관한 것이다.The present invention relates to a method for detecting and using violent detection framework using spatial and temporal characteristics analysis of deep learning-based shaded images, and more specifically, to detect the violence of an image by improving the part that easily loses the characteristic points that have passed when performing violent image convolution. A dip that improves ability and accuracy, enables analysis regardless of the length of the analysis frame with the use of a small filter, and enables learning of continuous frames by moving the time axis of the filter, enabling analysis of continuous behavior. It relates to a method of detecting and using violent detection framework by analyzing the spatial and temporal characteristics of a running-based shadow image.

일반적으로 주택가나 빌딩, 도로나 공공시설 등에서 폭력, 폭행, 또는 납치 사건 등이 발생하게 되었을 때, 주변에 사람이 존재하지 않거나 무관심으로 인해 신고가 이루어지지 않게 되면 그 발생 원인이나 정도를 판단하기 위한 정보 수집이 어렵게 되므로, 이러한 사건, 사고의 예방 및 보안수단으로 우범지대, 어두운 골목, 외진 지역 등에 영상을 촬영할 수 있는 다수 개의 CCTV를 각각 설치하고 이를 한꺼번에 관제센터 등에서 수집하여 다수의 CCTV 화면이 집결된 모니터링 화면으로 감시하도록 하고 있다.In general, when violence, assault, or kidnapping incidents occur in residential areas, buildings, roads, or public facilities, if there are no people around or reports are not made due to indifference, to determine the cause or extent of the occurrence Since it is difficult to collect information, a number of CCTV screens are gathered by installing a number of CCTVs, each of which can be used to shoot images in rainy areas, dark alleys, remote areas, etc. The monitored screen is being monitored.

그러나 관제센터 등에서 CCTV를 모니터링하는 화면은 보통 수십 개가 존재하지만 이를 확인하는 감시자는 소수이고, 게다가 폭력, 폭행, 납치 사건 등의 사건, 사고 행위는 순식간, 또는 비교적 짧은 시간에 발생하기 때문에 소수의 감시자가 모니터링 화면을 통해 확인하기란 쉽지 않은 단점이 있었으며, 이를 보완하고자, 최근에는 영상분석을 통한 폭력 검출 시스템이 개발되고 있다.However, there are dozens of screens for monitoring CCTVs in control centers, etc., but only a few monitor them to confirm this, and in addition, incidents such as violence, assault, and kidnapping incidents and accidents occur in an instant or in a relatively short time. There was a disadvantage that it was not easy to check through the monitoring screen, and in order to compensate for this, a violence detection system through image analysis has recently been developed.

이와 관련하여, 기존의 폭력검출 프레임워크는 MoSIFT+HIK(Violence detection in video using computer vision techniques), VIF(Violent flows), MoSIFT+KDE+Sparse Coding(Violent video detection based on mosift feature and sparse coding), Gracia et al(Fast fight detection), Substantial Derivative(Violence detection in crowded scenes using substantial derivative), Bilinski et al(Human violence recognition and detection in surveillance videos), MoIWLD(Discriminative dictionary learning with motion weber local descriptor for violence detection), ViF+OViF(Violence detection using oriented violent flows), 및 Three streams + LSTM(Multi-stream deep networks for person to person violence detection in videos) 등이 있다.In this regard, existing violent detection frameworks include MoSIFT+HIK (Violence detection in video using computer vision techniques), VIF (Violent flows), MoSIFT+KDE+Sparse Coding (Violent video detection based on mosift feature and sparse coding), Gracia et al (Fast fight detection), Substantial Derivative (Violence detection in crowded scenes using substantial derivative), Bilinski et al (Human violence recognition and detection in surveillance videos), MoIWLD (Discriminative dictionary learning with motion weber local descriptor for violence detection) , ViF+OViF (Violence detection using oriented violent flows), and Three streams + Multi-stream deep networks for person to person violence detection in videos (LSTM).

(1) MoSIFT+HIK : Violence detection in video using computer vision techniques(1) MoSIFT+HIK: Violence detection in video using computer vision techniques

A. 방법 : 도 1a에 예시된 바와 같이 시공간적 해석이 가능한 local feature와 BoW를 기반으로 영상 특성을 나타내고 SVM(Support Vector Machine)을 사용하여 폭력여부를 판단하며, Space-time interest points(STIP) : Harris corner detection의 결과를 시공간적으로 분석하게 되므로 corner point의 시공간적 변화를 분석할 수 있다.A. Method: As illustrated in FIG. 1A, it shows image characteristics based on a local feature and a BoW capable of spatio-temporal analysis, and uses SVM (Support Vector Machine) to determine whether or not violence, Space-time interest points (STIP): Since the results of Harris corner detection are analyzed spatiotemporally, the spatiotemporal changes of corner points can be analyzed.

B. Motion SIFT (MoSIFT) : 도 1b에 예시된 바와 같이, Standard SIFT(scale-invariant feature transform) + optical flow based local motion : SIFT 기반의 local feature 특성에 optical flow 기반의 정보를 결합하여 local feature의 변화 특성을 분석할 수 있다.B. Motion SIFT (MoSIFT): As illustrated in FIG. 1B, standard scale-invariant feature transform (SIFT) + optical flow based local motion: combining optical flow-based information with SIFT-based local feature characteristics to determine local feature Change characteristics can be analyzed.

SIFT는 대부분 corner가 interest point로 추출되고 corner 주변 영역을 descriptor로 표현(histogram of oriented gradients)한다.In SIFT, most corners are extracted as points of interest, and the area around the corners is represented by descriptors (histogram of oriented gradients).

C. Bag-of-Words (BOW) : 도 1c에 예시된 바와 같이, Visual word(특정한 feature의 조합을 이용한 local descriptor)의 히스토그램으로 영상의 특성을 설명하는 방법이며, Visual word 기반의 특징을 학습과 분류의 정보로 사용한다.C. Bag-of-Words (BOW): As illustrated in FIG. 1C, this is a method of explaining the characteristics of an image with a histogram of a visual word (local descriptor using a combination of specific features), and learning the features based on the visual word It is used as information for classification.

(2) VIF : Violent flows : real-time detection of violent crowd behavior.(2) VIF: Violent flows: real-time detection of violent crowd behavior.

A. 방법 : optical flow vector magnitudes의 변화 양상을 SVM을 이용하여 폭력과 비폭력으로 구분한다.A. Method: The change pattern of optical flow vector magnitudes is divided into violence and non-violence using SVM.

B. ViF : Optical flow magnitude의 시간에 따른 변화 양상 표현하며, magnitude 자체의 값은 고려하지 않는다.B. ViF: Expresses the change pattern of optical flow magnitude over time, and does not consider the value of magnitude itself.

C. Classification : ViF와 ViF word를 사용하여 영상을 나타내고, SVM를 이용하여 폭력 여부를 판단한다.C. Classification: ViF and ViF words are used to display images, and SVM is used to determine whether or not there is violence.

(3) MoSIFT+KDE+Sparse Coding : Violent video detection based on mosift feature and sparse coding(3) MoSIFT+KDE+Sparse Coding: Violent video detection based on mosift feature and sparse coding

A. 방법 : 도 1d에 예시된 바와 같이, MoSIFT를 KDE 기반으로 선별하고, sparse coding를 통하여 feature vector를 생성하여 폭력여부를 판단한다.A. Method: As illustrated in FIG. 1D, MoSIFT is selected based on KDE, and feature vectors are generated through sparse coding to determine whether or not there is violence.

B. KDE(Kernel Density Estimation) : 히스토그램의 분포에서의 불연속성 및 bin의 크기 및 범위에 따른 분포 변화의 문제를 해결한 방법이며, 관측된 데이터마다 kernel function를 생성하고, 모든 kernel를 모두 합하여 전체 데이터 분포를 표현한다.B. KDE (Kernel Density Estimation): This method solves the problem of discontinuity in the distribution of histograms and distribution change according to the size and range of bins, generates a kernel function for each observed data, and adds all kernels to the entire data Express the distribution.

(4) Gracia et al : Fast fight detection(4) Gracia et al: Fast fight detection

A. 방법 : 도 1e에 예시된 바와 같이, 프레임 간의 차이를 이용한 motion blob를 분석하여 폭력과 비폭력을 구분한다.A. Method: As illustrated in FIG. 1E, the motion blob using the difference between frames is analyzed to distinguish violence from non-violence.

B. Motion blob(blob간의)의 shape, position 분석을 통하여 global motion과 local motion의 차이를 분석할 수 있다.B. You can analyze the difference between global motion and local motion by analyzing the shape and position of the motion blob (between blobs).

(5) Substantial Derivative : Violence detection in crowded scenes using substantial derivative.(5) Substantial Derivative: Violence detection in crowded scenes using substantial derivative.

A. 방법 : 도 1f에 예시된 바와 같이, 영상 간의 optical flow의 시공간적 특성을 추출하고(substantial derivative), 이를 BoW로 표현하여 폭력 여부를 판단한다.A. Method: As illustrated in FIG. 1F, spatiotemporal characteristics of the optical flow between images are extracted (substantial derivative), and this is expressed as BoW to determine whether there is violence.

(6) Bilinski et al. : Human violence recognition and detection in surveillance videos.(6) Bilinski et al. : Human violence recognition and detection in surveillance videos.

A. 방법 : Improved fisher filter에 시공간적 정보를 반영하여 폭력을 검지한다.A. Method: Detect violence by reflecting spatiotemporal information in the improved fisher filter.

(7) MoIWLD : Discriminative dictionary learning with motion weber local descriptor for violence detection(7) MoIWLD: Discriminative dictionary learning with motion weber local descriptor for violence detection

(8) ViF+OViF : Violence detection using oriented violent flows(8) ViF+OViF: Violence detection using oriented violent flows

A. 방법 : 도 1g에 예시된 바와 같이, ViF의 개념을 optical flow direction에 적용한 OViF를 이용하여 영상의 나타내고 폭력을 검지한다.A. Method: As illustrated in FIG. 1G, the image is represented and the violence is detected using OViF applying the concept of ViF to the optical flow direction.

(9) Three streams + LSTM : Multi-stream deep networks for person to person violence detection in videos(9) Three streams + LSTM: Multi-stream deep networks for person to person violence detection in videos

A. 방법 : 도 1h에 예시된 바와 같이, 기존의 단일 사람의 행동(예) 걷기, 팔 뻗기)을 분석하는 것으로는 폭행이 발행했을 때의 복합적인 형상을 분석할 수 있으며, 이를 해결하기 위하여 CNN을 이용하여 사람과 사람간의 형상를 자체를 학습하여 폭력을 검지한다.A. Method: As illustrated in FIG. 1H, by analyzing an existing single person's behavior (eg, walking, stretching an arm), it is possible to analyze a complex shape when an assault is issued, and to solve this CNN is used to learn the image of people and people to detect violence.

그러나, 상기와 같은 종래의 기술들에서는 폭력 속성상 적어도 2명 이상의 사람들이 뒤엉켜져 복잡한 움직임을 갖는 것이 일반적이므로 이와 같이 엉켜져 있는 영상 속에서 폭력성을 검출하는 것은 쉽지 않은 문제점이 있었으며, 또한 이러한 종래의 방법들 중에서 행동의 시차적 차이까지 고려한 검출방법은 존재하지 않기 때문에 이와 같은 행동의 시차적 차이를 고려하지 않은 검출시스템은 그 성능이 저하될 수 밖에 없는 문제점이 있었다.However, in the conventional techniques as described above, since it is common for at least two or more people to get entangled and have complicated movements due to the nature of violence, it is not easy to detect violence in such entangled images. Since there is no detection method considering the parallax difference of the behavior among the methods of, the detection system that does not consider the parallax difference of the behavior has a problem that the performance must be deteriorated.

KR 10-1541272 B1 2015.07.28. 등록KR 10-1541272 B1 2015.07.28. Enrollment KR 10-1552344 B1 2015.09.04. 등록KR 10-1552344 B1 2015.09.04. Enrollment KR 10-1651410 B1 2016.08.22. 등록KR 10-1651410 B1 2016.08.22. Enrollment

따라서 본 발명은 상기의 문제점을 해결하기 위해 안출한 것으로서, 본 발명이 해결하고자 하는 기술적 과제는, 실시간 입력 영상에서 색차 성분(U,V)를 제외한 흑백 음영 영상인 휘도성분(Y) 영상을 추출한 후 네트워크 경량화 및 시간 공간에 최적화된 영상을 만들어 알고리즘에 적용하고 영상 컨볼루션 과정에서 특정 레이어에 폭력의 특징점을 지속적으로 기억시키고 재학습할 수 있도록 함으로써 폭력영상 컨볼루션 수행시에 지나온 특징점을 쉽게 잃어버리게 되는 부분을 개선하여 영상의 폭력성 검출능력 및 정확도를 향상시킬 수 있으며, 작은 필터의 사용으로 분석 프레임의 길이에 구애받지 않고 분석이 가능하고, 필터의 시간축 이동을 통해 연속 프레임에 대한 학습이 가능하게 하여 연속된 행동에 대한 분석이 가능한 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법을 제공하고자 하는 것이다.Therefore, the present invention was devised to solve the above problems, and a technical problem to be solved by the present invention is to extract a luminance component (Y) image, which is a black and white shaded image excluding color difference components (U, V) from a real-time input image. After that, it makes the network lightweight and optimized for time space, applies it to the algorithm, and continuously remembers and re-learns the feature points of violence on a specific layer during the video convolution process. By improving the discarded part, it is possible to improve the violent detection ability and accuracy of the image, and by using a small filter, analysis is possible regardless of the length of the analysis frame, and continuous frame learning is possible by moving the time axis of the filter. This is to provide a method for detecting and using violent detection framework by analyzing the spatial and temporal characteristics of deep learning-based shadow images that can analyze continuous behavior.

상기 목적을 달성하기 위한 본 발명의 일 실시 형태는, 비디오 카메라 또는 동영상 파일에서 주어지는 영상 프레임들로 구성된 입력 영상에서 폭력의 특징점을 검출하여 영상의 폭력성을 검출하는 폭력검출 프레임워킹 방법에 있어서, 입력 영상에 포함된 하나의 프레임의 영상에서 색차 성분(U,V)을 제외하여 2차원(2D) 기반의 휘도 성분(Y) 영상을 추출하는 단계, 2D 기반의 Y 영상을 순차적으로 3차원(3D)으로 축적하고 이 중에서 균등한 간격의 프레임만을 추출하여 3차원(3D) 기반의 Y 영상 그룹을 획득하는 단계, 및 3D 기반의 Y 영상 그룹에 대하여 영상 컨볼루션을 수행하고, 3*3*3 필터를 사용하여 폭력 검출 장면을 도출하는 단계를 포함하는, 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법이다.In one embodiment of the present invention for achieving the above object, in the input frame consisting of video frames provided from a video camera or a video file, in the violence detection frameworking method of detecting the violence of the image by detecting the characteristic point of violence, input Extracting the 2D (2D) based luminance component (Y) image by excluding the color difference components (U, V) from the image of one frame included in the image, sequentially 3D (3D) the 2D based Y image ) And extracting only equally spaced frames to obtain a 3D (3D) based Y image group, and performing 3D based Y image group image convolution, 3*3*3 It is a frame detection method for violent detection using a spatial and temporal character analysis of deep learning-based shadow images, including deriving a scene of violence detection using a filter.

본 발명에 의하면, 실시간 입력 영상을 적(R),녹(G),청(B)을 제외한 Y 프레임 흑백 음영 영상을 추출한 후 네트워크 경량화 및 시간 공간에 최적화된 영상을 만들어 알고리즘에 적용하고 영상 컨볼루션 과정에서 재학습방법을 사용하여 특정 레이어에 폭력의 특징점을 지속적으로 기억시키고 재학습할 수 있도록 함으로써 폭력영상 컨볼루션 수행시에 지나온 특징점을 쉽게 잃어버리게 되는 부분을 개선하여 영상의 폭력성 검출능력을 향상시키는 이점을 제공할 수 있게 된다.According to the present invention, a real-time input image is extracted from a Y-frame black and white shaded image excluding red (R), green (G), and blue (B), and then applied to the algorithm by creating a network-weighted and time-space-optimized image and applying it to the algorithm. By using the re-learning method in the course of a solution, it is possible to continuously remember and re-learn the characteristic points of violence on a specific layer, thereby improving the part that easily loses the characteristic points that have passed when performing violent video convolution, thereby improving the ability to detect violence in the video. It is possible to provide the advantage of improving.

또한 본 발명은 기존의 프레임워크(3 x 3 x F)보다 작은 필터(3 x 3 x 3 커널)를 사용하게 되므로, 분석 프레임의 길이에 구애받지 않고 분석이 가능하게 하는 이점을 제공할 수 있게 된다.In addition, since the present invention uses a filter (3 x 3 x 3 kernel) smaller than the existing framework (3 x 3 x F), it is possible to provide an advantage that enables analysis regardless of the length of the analysis frame. do.

또한 본 발명은 필터의 시간축 이동을 통해 기존의 프레임 워크에 비해 더 많은 연속된 프레임에 대한 학습을 가능하게 하여 연속된 행동에 대한 분석이 가능하게 하는 이점을 제공할 수 있게 된다.In addition, the present invention is able to provide an advantage that enables the analysis of the continuous behavior by enabling the learning of more consecutive frames than the existing framework through the time axis movement of the filter.

또한 본 발명은 3D 컨볼루션에 대한 잔여 네트워크(Residual networks)의 적용으로 학습 시간 및 검출 정확도를 향상시킬 수 있는 이점을 제공할 수 있게 된다.In addition, the present invention can provide an advantage of improving learning time and detection accuracy by applying residual networks to 3D convolution.

도 1a 내지 도 1h는 종래의 각종 폭력검출 프레임워크를 각각 예시한 참고도이다.
도 2는 본 발명에 의한 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법을 예시한 블록도이다.
도 3은 본 발명에 의한 폭력검출 프레임워킹 방법에서 영상 컨볼루션시 재학습 과정을 예시한 상세도이다.1A to 1H are reference views respectively illustrating various conventional violence detection frameworks.
FIG. 2 is a block diagram illustrating a method for detecting violent frames using spatial and temporal characteristics analysis of a deep learning-based shadow image according to the present invention.
3 is a detailed diagram illustrating a re-learning process during video convolution in the method for detecting violence in a frame according to the present invention.

이하, 본 발명의 바람직한 실시 형태에 따른 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법의 구성과 동작 및 그에 의한 작용 효과를 첨부 도면을 참조하여 상세히 설명한다.Hereinafter, the configuration and operation of the violent detection frameworking method utilizing the spatial and temporal characteristics analysis of the deep learning-based shadow image according to the preferred embodiment of the present invention and the effect of the action will be described in detail with reference to the accompanying drawings.

본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정 해석되지 아니하며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.The terms or words used in the specification and claims are not to be construed as being limited to ordinary or lexical meanings, and the inventor is based on the principle that the concept of terms can be properly defined in order to best describe his or her invention It should be interpreted in a sense and concept consistent with the technical idea of the present invention. Therefore, since the embodiments illustrated in the present specification and the configuration illustrated in the drawings are only the most preferred embodiments of the present invention, it is understood that there may be various equivalents and modifications that can replace them at the time of application. shall.

도 2는 본 발명에 의한 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법을 예시한 블록도이고, 도 3은 본 발명에 의한 폭력검출 프레임워킹 방법에서 영상 컨볼루션시 재학습 과정을 예시한 상세도로서, 도면에 예시된 바와 같이 본 발명의 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법은, 입력 영상에 포함된 하나의 프레임의 영상에서 색차 성분(U,V)을 제외하여 2차원(2D) 기반의 휘도 성분(Y) 영상을 추출하는 단계, 2D 기반의 Y 영상을 순차적으로 3차원(3D)으로 축적하고 이 중에서 균등한 간격의 프레임만을 추출하여 3차원(3D) 기반의 Y 영상 그룹을 획득하는 단계, 및 3D 기반의 Y 영상 그룹에 대하여 영상 컨볼루션을 수행하고 3*3*3 필터를 사용하여 원하는 검출 장면을 도출하는 단계를 포함하여 구성되며, 이러한 본 발명은 비디오 카메라 또는 동영상 파일에서 주어지는 영상 프레임들로 구성된 입력 영상에서 폭력의 특징점을 검출하는 폭력검출 시스템의 영상 분석부에 소프트웨어 또는 플랫폼의 형태로 탑재되어 운용될 수 있다. 또한 본 발명을 수행하기 위해 영상분석부는 입력 영상이 RGB 기반의 영상인 경우, 입력영상을 RGB기반 영상에서 YUV기반 영상으로 변환할 수 있다.FIG. 2 is a block diagram illustrating a method for detecting violent detection frames using spatial and temporal characteristics analysis of a deep learning-based shadow image according to the present invention, and FIG. 3 is re-learning during video convolution in the method for detecting violent frames according to the present invention As a detailed diagram illustrating the process, as illustrated in the drawing, the violent detection frameworking method using the spatial and temporal characteristics analysis of the deep learning-based shadow image of the present invention includes a color difference component ( Extracting 2D (2D)-based luminance component (Y) images except U,V), sequentially accumulating 2D-based Y images in 3D (3D), and extract only equally spaced frames And obtaining a 3D (3D)-based Y image group, and performing image convolution on the 3D-based Y image group and deriving a desired detection scene using a 3*3*3 filter. The present invention can be installed and operated in the form of software or a platform in an image analysis unit of a violence detection system that detects a feature point of violence in an input image composed of video cameras or video frames given from a video file. In addition, in order to perform the present invention, when the input image is an RGB-based image, the image analysis unit may convert the input image from an RGB-based image to a YUV-based image.

영상분석부는 먼저 입력 영상에 포함된 하나의 프레임의 영상에서 색차 성분(U,V)을 제외하여 2차원(2D) 기반의 휘도 성분(Y) 영상을 추출하는 단계를 수행한다.The image analyzer first extracts a two-dimensional (2D)-based luminance component (Y) image by excluding the color difference components (U, V) from the image of one frame included in the input image.

삭제delete

영상분석부는 상기 추출된 2D 기반의 Y 영상을 순차적으로 3차원(3D)으로 축적하고, 이 중에서 균등한 간격의 프레임만을 추출하여 3차원(3D) 기반의 Y 영상 그룹을 획득하는 단계를 수행한다. 본 발명에서는 2차원(2D) 기반의 30개 프레임을 쌓아 3차원 공간을 만드는 것이 바람직하다.The image analysis unit sequentially accumulates the extracted 2D Y images in 3D (3D), and extracts only equally spaced frames to obtain a 3D (3D) based Y image group. . In the present invention, it is preferable to make a 3D space by stacking 30 frames based on 2D (2D).

영상분석부는 상기 획득된 3D 기반의 Y 영상 그룹에 대하여 영상 컨볼루션을 수행하고, 3*3*3 필터를 사용하여 폭력 검출 장면을 도출하는 단계를 수행한다.The image analysis unit performs image convolution on the obtained 3D-based Y image group and derives a violence detection scene using a 3*3*3 filter.

이러한 영상분석부는 검출장면 도출단계에서 3D 환경의 Y 프레임 흑백 음영 영상 중에서 균등한 순번째 레이어(예를 들면 미리 지정된 순번째 레이어)의 프레임만을 추출하도록 설정되는 것이 바람직하며, 10개 레이어를 사용하여 영상 컨볼루션을 수행하도록 설정되는 것이 바람직하다.It is preferable that such an image analysis unit is set to extract only frames of a uniform layer (for example, a predetermined layer) in a Y-frame black-and-white image in a 3D environment in the detection scene derivation step. It is preferably set to perform image convolution.

또한 이러한 영상분석부는 검출장면 도출단계에서 상기 추출된 10개 레이어의 프레임 중에서 미리 지정된 균등한 순번째 레이어의 프레임에 지나온 폭력의 특징점을 기억시키고, 폭력의 특징점을 기억시킨 해당 프레임의 다음 프레임에 재학습하여 폭력 행위를 기억할 수 있도록 함으로써, 영상 컨볼루션 과정에서 잃어버린 폭력의 특징점 손실 문제를 개선할 수 있게 한다.In addition, the image analysis unit remembers a feature point of violence that has passed in a frame of a predetermined uniform layer of the 10 layers extracted from the extracted scene in the detection scene derivation step, and replays it in the next frame of the frame in which the feature point of violence is stored. By learning and remembering the act of violence, it is possible to improve the problem of losing the characteristic point of violence lost during the video convolution process.

또한 이러한 영상분석부는 검출장면 도출단계에서 최초 224*224 크기의 영상에 대하여 3*3*3 필터를 사용하여 제1합성곱 연산을 수행하고, 제1합성곱 연산된 상기 224*224 크기의 영상을 112*112 크기의 영상으로 1차 풀링 변환하고, 1차 풀링 변환된 112*112 크기의 영상에 대하여 제2합성곱 연산을 수행하여 1차 재학습하고, 상기 제2합성곱 연산된 112*112 크기의 영상에 대하여 제1합성곱 연산과 동일한 제3합성곱 연산, 56*56 크기의 영상으로 2차 풀링 변환, 및 제4합성곱 연산을 수행하도록 하며, 이러한 제1합성곱 연산과 제2합성곱 연산, 및 제3합성곱 연산과 제4합성곱 연산에서, 미리 지정된 균등한 순번째 레이어에 폭력의 특징점을 각각 기억시키고, 폭력의 특징을 기억시킨 해당 프레임의 다음 프레임에 각각 재학습하여, 폭력 행위를 기억시키는 재학습방법을 사용하도록 한다.In addition, the image analysis unit performs a first convolution operation using a 3*3*3 filter on an image of the first 224*224 size in the step of deriving a detection scene, and the image of the 224*224 size calculated by the first convolution operation Transforms the first pooling into a 112*112-sized image, performs a second convolution operation on the first-pooled transformed 112*112-sized image, and performs primary re-learning. For a 112-sized image, a third convolution operation equal to the first convolution operation, a quadratic pooling transform with an image of 56*56 size, and a fourth convolution operation are performed. In the 2nd convolution operation, and in the 3rd and 4th convolution operation, the characteristic points of violence are respectively stored in the predetermined equal order layer, and re-learned in the next frame of the corresponding frame in which the characteristic of violence is stored. Therefore, use a re-learning method to remember the act of violence.

또한 이러한 영상분석부는 검출장면 도출단계에서, 3차원(3D) 공간에서 3x3x3 커널(kernel) 8개와 2x2x2 커널(kernel) 2개를 이용하여 영상의 폭력성을 검출하도록 함으로써, 분석 프레임의 길이에 구애받지 않고 분석이 가능하게 하면서도 필터의 시간축 이동을 통해 기존의 프레임 워크에 비해 더 많은 연속된 프레임에 대한 학습을 가능하게 하여 연속된 행동에 대한 분석이 가능하게 하여 학습 시간 및 검출 정확도를 향상시킬 수 있도록 한다.In addition, the image analysis unit detects the violence of the image using 8 3x3x3 kernels and 2 2x2x2 kernels in a 3D (3D) space in the detection scene derivation step, so that it is not limited by the length of the analysis frame. Without analysis, it is possible to improve the learning time and detection accuracy by enabling the analysis of continuous behavior by enabling more continuous frames to be learned compared to the existing framework by moving the time axis of the filter. do.

이상과 같이 구성되는 본 발명에 따른 딥러닝 기반 음영영상의 시공간적 특성 분석을 활용한 폭력검출 프레임워킹 방법의 작용 효과를 설명하면 다음과 같다.When explaining the effect of the violent detection frameworking method using the spatial and temporal characteristics analysis of the deep learning-based shadow image according to the present invention configured as described above are as follows.

본 발명의 폭력검출 프레임워킹 방법은 폭력검출 시스템의 영상 분석부에 탑재되며, 이러한 영상분석부는 비디오 카메라 또는 동영상 파일에서 실시간으로 들어오는 입력 영상을 각 프레임당 영상으로 나누고 그 분리된 각 프레임당 영상에서 적(R), 녹(G), 청(B)을 제외하여 2D 기반의 Y 프레임 흑백 영상을 추출한다.The violent detection frameworking method of the present invention is mounted on the video analysis unit of the violence detection system, and the video analysis unit divides the input image input in real time from the video camera or video file into images per frame and separates the images from each frame. Excluding red (R), green (G), and blue (B), 2D-based Y-frame monochrome images are extracted.

다음으로 상기 추출된 2D 기반의 Y 프레임 흑백 영상을 다수 개(바람직하게는 30 프레임)를 순차적으로 축적하여 3D 환경의 Y 프레임 흑백 영상으로 전환한다.Next, a plurality of the extracted 2D-based Y-frame black and white images (preferably 30 frames) are sequentially accumulated and converted into a Y-frame black and white image in a 3D environment.

그런데. 이러한 3D 환경의 Y 프레임 흑백 영상들은 입력영상장치 및 영상 전달환경에 따라 균일하게 들어오지 않기 때문에 따라서 본 발명에서는 3D 환경의 Y 프레임 흑백 영상 중에서 균등한 순번째 레이어의 프레임(예를 들면 미리 지정된 순번째 레이어로서, 바람직하게는 5 프레임)만을 추출하여 그 추출된 프레임만을 다시 축적하고 10개 레이어를 사용하여 영상 컨볼루션을 수행한 후 시공간에 적합한 3*3*3 필터를 사용하여 원하는 검출 장면을 도출한다.By the way. Since the Y-frame black and white images of the 3D environment do not uniformly come in according to the input image device and the image transmission environment, therefore, in the present invention, a frame of the same layer in the Y-frame black-and-white image of the 3D environment (for example, a predetermined first order) As a layer, preferably, only 5 frames) are extracted, only the extracted frames are accumulated again, image convolution is performed using 10 layers, and a desired detection scene is derived using a 3*3*3 filter suitable for space-time. do.

또한 상기 2D 기반의 Y 프레임 흑백 영상 30개를 순차적으로 축적하여 3D 환경의 Y 프레임 흑백 영상을 만들기 위해 폭력영상을 컨볼루션하게 되면, 지나온 특징점을 쉽게 잃어버리게 되므로 이러한 문제점을 개선하기 위하여 본 발명에서는 상기 추출된 10개 레이어 중에서 적어도 2개의 레이어에 지나온 폭력의 특징점을 각각 기억시키고, 폭력의 특징점을 기억시킨 해당 레이어의 다음 레이어에 재학습시켜 폭력 행위를 기억할 수 있도록 함으로써, 지나온 특징점을 재사용하는 방법으로 영상 컨볼루션 과정에서 잃어버린 폭력의 특징점 손실 문제를 개선할 수 있게 된다.In addition, when 30 convex YD black and white images based on 2D are sequentially accumulated to convolve a violent image to create a Y frame black and white image in a 3D environment, past feature points are easily lost. A method of reusing reused feature points by remembering each feature point of violence that has passed through at least two layers among the extracted 10 layers and re-learning the next layer of the corresponding layer that remembers the feature points of violence, so that the behavior of violence can be remembered. As a result, it is possible to improve the problem of losing the characteristic points of violence lost during the video convolution process.

이때 영상 컨볼루션은 최초 224*224 크기의 영상에 대하여 제1합성곱 연산을 수행하여 112*112 크기의 영상으로 1차 변환하고, 1차 변환된 112*112 크기의 영상에 대하여 제2합성곱 연산을 수행하여 56*56 크기의 영상으로 2차 변환하여 컨볼루션하며, 이러한 제1합성곱 연산 및 제2합성곱 연산에서, 예를 들면 3 레이어 및 8 레이어에 폭력의 특징점을 각각 기억시키고, 기억된 폭력의 특징을 다음 레이어(4레이어 및 9 레이어)에 각각 재학습시켜, 폭력 행위를 기억시킨다.At this time, the image convolution performs the first convergence operation on the first 224*224 sized image to first transform it into a 112*112 sized image, and the second transformed product on the first transformed 112*112 sized image. Convolution by performing a second transform to an image of size 56*56 by performing the calculation, and storing the characteristic points of violence in layers 3 and 8, respectively, in the first and second convolution operations, The characteristics of the remembered violence are re-learned on the next layer (layers 4 and 9), respectively, to remember the act of violence.

또한 이러한 본 발명에서는 3차원(3D) 공간에서 3x3x3 커널(kernel) 8개와 2x2x2 커널(kernel) 2개를 이용하여 영상의 폭력성을 검출하게 되므로 분석 프레임의 길이에 구애받지 않고 분석이 가능하게 되며, 또한 필터의 시간축 이동을 통해 기존의 프레임 워크에 비해 더 많은 연속된 프레임에 대한 학습을 가능하게 되므로 연속된 행동에 대한 분석이 가능하게 되어 학습 시간 및 검출 정확도를 향상시킬 수 있게 된다.In addition, in the present invention, the violence of an image is detected using 8 3x3x3 kernels and 2 2x2x2 kernels in a 3D (3D) space, so analysis is possible regardless of the length of the analysis frame. In addition, it is possible to learn more continuous frames than the existing framework through the time axis movement of the filter, so it is possible to analyze the continuous behavior, thereby improving the learning time and detection accuracy.

하기의 표 1에는 이상과 같은 본 발명에 의한 폭력검출 프레임워크의 정확도를 기존 다수의 폭력검출 프레임워크 정확도와 비교한 결과를 예시하고 있다.Table 1 below illustrates the results of comparing the accuracy of the violence detection framework according to the present invention with the accuracy of a number of existing violence detection frameworks.

Hockey DatasetHockey Dataset Movies DatasetMovies Dataset Vioent-Flows DatasetVioent-Flows Dataset 1)MoSIFT+HIK1)MoSIFT+HIK 90.9%90.9% 89.5%89.5% 2)VIF2) VIF 82.9±0.14%82.9±0.14% 81.3±0.21%81.3±0.21% 3)MoSIFT+KDE+Sparse coding3)MoSIFT+KDE+Sparse coding 94.3±1.68%94.3±1.68% 4)Gracia et al.4) Gracia et al. 82.4±0.4%82.4±0.4% 97.8±0.4%97.8±0.4% 5)Substantial Derivative5)Substantial Derivative 96.89±0.2%96.89±0.2% 85.43±0.21%85.43±0.21% 6)Bilinski et al.6) Bilinski et al. 93.4%93.4% 99%99% 96.4%96.4% 7)MoIWLD7) MoIWLD 96.8±1.04%96.8±1.04% 93.19±0.1%93.19±0.1% 8)ViF+OViF8)ViF+OViF 87.5±1.7%87.5±1.7% 88±2.45%88±2.45% 9)Three streams+LSTM9)Three streams+LSTM 본 발명The present invention 97.1±0.23%97.1±0.23% 99%99% 95.61±2.76%95.61±2.76%

이상의 본 발명에 의하면, 실시간 입력 영상을 적(R),녹(G),청(B)을 제외한 Y 프레임 흑백 음영 영상을 추출한 후 네트워크 경량화 및 시간 공간에 최적화된 영상을 만들어 알고리즘에 적용하고 영상 컨볼루션 과정에서 재학습방법을 사용하여 특정 레이어에 폭력의 특징점을 지속적으로 기억시키고 재학습할 수 있게 하므로 폭력영상 컨볼루션 수행시에 지나온 특징점을 쉽게 잃어버리게 되는 부분을 개선하여 영상의 폭력성 검출능력을 향상시킬 수 있으며, 또한 기존의 프레임워크(3 x 3 x F)보다 작은 필터(3 x 3 x 3 커널)를 사용하게 되므로 분석 프레임의 길이에 구애받지 않고 분석이 가능하게 하는 이점을 제공할 수 있게 된다.According to the present invention, after extracting the Y-frame black and white shaded image except for the red (R), green (G), and blue (B), the real-time input image is applied to the algorithm by creating an image optimized for network weight reduction and time space. By using the re-learning method in the convolution process, it is possible to continuously remember and re-learn the characteristic points of violence on a specific layer, so it improves the part that easily loses the characteristic points passed when performing the violent image convolution and detects the violence of the image It is possible to improve the performance and also use a filter (3 x 3 x 3 kernel) that is smaller than the existing framework (3 x 3 x F), providing the advantage of enabling analysis regardless of the length of the analysis frame. It becomes possible.

또한 본 발명에 의하면, 필터의 시간축 이동을 통해 기존의 프레임 워크에 비해 더 많은 연속된 프레임에 대한 학습을 가능하게 되므로 연속된 행동에 대한 분석이 가능하게 되고, 3D 컨볼루션에 대한 잔여 네트워크의 적용으로 학습 시간 및 검출 정확도를 향상시킬 수 있게 된다.In addition, according to the present invention, it is possible to learn more continuous frames than the existing framework through the time axis movement of the filter, so it is possible to analyze the continuous behavior and apply the residual network to 3D convolution. As a result, the learning time and detection accuracy can be improved.

이상과 같이 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 이는 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. 따라서, 본 발명의 사상은 아래에 기재된 특허 청구 범위에 의해서만 파악되어야 하고, 이의 균등 또는 등가적 변형 모두는 본 발명 사상의 범주에 속한다고 할 것이다.As described above, although the present invention has been described by limited embodiments and drawings, the present invention is not limited to the above embodiments, and various modifications and modifications from these descriptions will be made by those skilled in the art to which the present invention pertains. Deformation is possible. Accordingly, the spirit of the present invention should be understood only by the scope of the claims set forth below, and all equivalent or equivalent modifications thereof will be said to fall within the scope of the spirit of the present invention.

11 : 분리된 프레임당 영상
12 : 2D 기반 Y 프레임 흑백 영상
13 : 3D 환경 Y 프레임 흑백 영상11: Video per separated frame
12: 2D based Y frame black and white video
13: 3D environment Y frame black and white video

Claims

In a violent detection system having a video analysis unit for detecting the violentity of an image by detecting a feature point of violence in an input image composed of video frames provided from a video camera or a video file, in the frame walking method of the violence detection performed by the image analysis unit, ,
Obtaining a two-dimensional (2D) based luminance component (Y) black and white Y image by excluding the color difference components (U, V) from the image of one frame included in the input image;
Sequentially accumulating the 2D-based Y images in 3D (3D), and extracting and accumulating only 30 frames of evenly spaced layers to obtain a 3D (3D)-based Y image group; And
And performing a video convolution on the 3D-based Y image group using a 3*3*3 filter and deriving a desired detection scene.
In the step of deriving the detection scene, the image analysis unit,
The first 224*224 sized image is first-composited using a 3*3*3 filter, and the first 224*224-scaled image is firstly transformed into a 112*112-size image. First re-learning by performing a second convolution operation on an image of the size of 112*112 that has been pooled and transformed by the first pooling. The third convolution operation, the second pooling transform to an image of 56*56 size, and the fourth convolution operation, and the first convolution operation, the second confluence operation, and the third convolution operation And in the fourth convolution operation, each feature point of violence is stored in a predetermined equal layer, and the next frame of the corresponding frame that remembers the feature of violence is re-learned to remember the act of violence,
Ten layers are extracted from Y-frame black and white images in a 3D environment and accumulated again to perform image convolution. Each of the extracted 10 layers is characterized by remembering the characteristic points of violence that have passed through at least two layers, and remembering the characteristic points of violence. Re-learning on the next layer of each layer uses a re-learning method to remember the act of violence,
Violence detection frame using spatial and temporal characteristics analysis of deep learning based shadow image, characterized by detecting the violence of the image using 8 3x3x3 kernels and 2 2x2x2 kernels in 3D space How to walk.

delete