CN111259116A

CN111259116A - Sensitive file detection method based on convolutional neural network

Info

Publication number: CN111259116A
Application number: CN202010048855.7A
Authority: CN
Inventors: 孔令武; 田峥; 黎曦; 关勇
Original assignee: Beijing Luoan Technology Co Ltd
Current assignee: Beijing Luoan Technology Co Ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-06-09

Abstract

The invention discloses a sensitive file detection method based on a convolutional neural network, which comprises the following steps: 1) training out word vector characteristics aiming at a data set; 2) combining the trained word vectors to obtain vector characteristics of the document; 3) and taking the vector as an input characteristic of deep learning based on the convolutional neural network, and taking the vector as a method operation parameter to participate in an iterative training process. By the technical scheme, the sensitive information of the business files related to the business secrets of the enterprises, such as national secret files of national power development strategies, national significant power project planning and the like, and enterprise operation data and the like, is filtered, and the defects that the existing method based on the sensitive word bank detection is low in detection efficiency, and reduces the rate of missing reports and the rate of false reports are high are overcome.

Description

Sensitive file detection method based on convolutional neural network

Technical Field

The invention relates to the technical field of information detection and filtering, in particular to a sensitive file detection method based on a convolutional neural network.

Background

The importance of monitoring sensitive information of key information infrastructure is particularly emphasized by the national network security law formally executed in 6/1/2017. The power industry is related to national economic life and national energy safety, and if a secret divulging event occurs, serious negative effects are certainly brought to the social image and the economic benefit of the industry. An effective network security monitoring system is established, and the important premise is to accurately identify sensitive data and then form an effective monitoring response scheme based on the sensitive data. Sensitive data in the operation process of the power enterprise are mainly divided into 1) national secret state secret documents of national power development strategy, national significant power project planning and the like; 2) enterprise business data, etc. relates to business secret files of the enterprise itself. The method focuses on how to effectively detect the business secret files in the enterprise business production. At present, most of sensitive file detection methods commonly used by power enterprises depend on a sensitive word bank, for example, secrecy, schemes, plans, internal data, bases and the like are partial keywords in the sensitive word bank, and the frequency of the keywords appearing in files is counted through a word segmentation algorithm and the like, so that whether the files belong to sensitive files is judged. The method has the advantages of high speed and high false alarm rate and missing alarm rate. For example, a novel contains words and phrases such as 'secret which cannot be spoken', 'plan next action', 'plan' and the like, the frequency of 'secret', 'plan' and 'plan' contained in the novel is counted according to the characteristic word library, the novel is wrongly judged as a sensitive office file, the sensitive file rechecking cost is increased, and illegal diffusion of the sensitive file is easily caused by manual rechecking.

The existing research is limited to solving the problem of enterprise business secret class file detection research or blank by using word vectors and convolutional neural networks.

Disclosure of Invention

Aiming at least one of the problems, the invention provides a sensitive information filtering method based on a convolutional neural network, and aims to solve the problem of low detection performance of sensitive files based on a feature word bank. As is well known, the convolutional neural network starts from image recognition and is widely applied, and the method provided by the invention firstly solves the problem of how to convert document data into image-like data, and the solution is to represent the semantic content form of the document into a two-dimensional matrix characteristic form similar to the image data.

In order to achieve the above object, the present invention provides a sensitive file detection method based on a convolutional neural network, comprising: representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.

In the above technical solution, preferably, a word vector matrix is formed based on deep learning of a convolutional neural network, a two-dimensional data matrix formed by the word vector feature extraction matrix is input to the convolutional neural network, and a common file and a sensitive file are output.

In the above technical solution, preferably, the actual processing process specifically includes: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer; and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.

Drawings

FIG. 1 is a schematic block diagram of a flow of a sensitive information filtering method based on a convolutional neural network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

a prototype system is constructed on the basis of a sensitive file detection method based on a convolutional neural network by utilizing a word vector and the convolutional neural network to solve the problem of enterprise business secret file detection, and is roughly divided into 3 modules: the system comprises a word vector training set module, a sensitive file training module and a detection module, and the overall architecture of the system is shown in FIG. 1. Specifically, a semantic content form of a document is expressed as a two-dimensional matrix characteristic form similar to image data, and word vector characteristics are trained aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.

The method comprises the steps of forming a word vector matrix based on deep learning of a convolutional neural network, inputting a two-dimensional data matrix formed by a word vector characteristic extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.

The actual processing process specifically comprises: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full-link layer; the detection module extracts word vector characteristics of the file to be detected, and then the sensitive file is identified through convolutional neural network calculation.

Claims

1. A sensitive file detection method based on a convolutional neural network is characterized by comprising the following steps:

representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set;

combining the trained word vectors to obtain vector characteristics of the document;

and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.

2. The sensitive document detection method based on the convolutional neural network as claimed in claim 1, wherein the deep learning based on the convolutional neural network forms a word vector matrix,

and inputting a two-dimensional data matrix formed by the word vector feature extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.

3. The sensitive file detection method based on the convolutional neural network as claimed in claim 1, wherein the actual processing procedure specifically comprises:

the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are;

sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer;

and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.