Nothing Special   »   [go: up one dir, main page]

CN111259116A - Sensitive file detection method based on convolutional neural network - Google Patents

Sensitive file detection method based on convolutional neural network Download PDF

Info

Publication number
CN111259116A
CN111259116A CN202010048855.7A CN202010048855A CN111259116A CN 111259116 A CN111259116 A CN 111259116A CN 202010048855 A CN202010048855 A CN 202010048855A CN 111259116 A CN111259116 A CN 111259116A
Authority
CN
China
Prior art keywords
neural network
convolutional neural
sensitive
vector
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010048855.7A
Other languages
Chinese (zh)
Inventor
孔令武
田峥
黎曦
关勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Luoan Technology Co Ltd
Original Assignee
Beijing Luoan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Luoan Technology Co Ltd filed Critical Beijing Luoan Technology Co Ltd
Priority to CN202010048855.7A priority Critical patent/CN111259116A/en
Publication of CN111259116A publication Critical patent/CN111259116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a sensitive file detection method based on a convolutional neural network, which comprises the following steps: 1) training out word vector characteristics aiming at a data set; 2) combining the trained word vectors to obtain vector characteristics of the document; 3) and taking the vector as an input characteristic of deep learning based on the convolutional neural network, and taking the vector as a method operation parameter to participate in an iterative training process. By the technical scheme, the sensitive information of the business files related to the business secrets of the enterprises, such as national secret files of national power development strategies, national significant power project planning and the like, and enterprise operation data and the like, is filtered, and the defects that the existing method based on the sensitive word bank detection is low in detection efficiency, and reduces the rate of missing reports and the rate of false reports are high are overcome.

Description

Sensitive file detection method based on convolutional neural network
Technical Field
The invention relates to the technical field of information detection and filtering, in particular to a sensitive file detection method based on a convolutional neural network.
Background
The importance of monitoring sensitive information of key information infrastructure is particularly emphasized by the national network security law formally executed in 6/1/2017. The power industry is related to national economic life and national energy safety, and if a secret divulging event occurs, serious negative effects are certainly brought to the social image and the economic benefit of the industry. An effective network security monitoring system is established, and the important premise is to accurately identify sensitive data and then form an effective monitoring response scheme based on the sensitive data. Sensitive data in the operation process of the power enterprise are mainly divided into 1) national secret state secret documents of national power development strategy, national significant power project planning and the like; 2) enterprise business data, etc. relates to business secret files of the enterprise itself. The method focuses on how to effectively detect the business secret files in the enterprise business production. At present, most of sensitive file detection methods commonly used by power enterprises depend on a sensitive word bank, for example, secrecy, schemes, plans, internal data, bases and the like are partial keywords in the sensitive word bank, and the frequency of the keywords appearing in files is counted through a word segmentation algorithm and the like, so that whether the files belong to sensitive files is judged. The method has the advantages of high speed and high false alarm rate and missing alarm rate. For example, a novel contains words and phrases such as 'secret which cannot be spoken', 'plan next action', 'plan' and the like, the frequency of 'secret', 'plan' and 'plan' contained in the novel is counted according to the characteristic word library, the novel is wrongly judged as a sensitive office file, the sensitive file rechecking cost is increased, and illegal diffusion of the sensitive file is easily caused by manual rechecking.
The existing research is limited to solving the problem of enterprise business secret class file detection research or blank by using word vectors and convolutional neural networks.
Disclosure of Invention
Aiming at least one of the problems, the invention provides a sensitive information filtering method based on a convolutional neural network, and aims to solve the problem of low detection performance of sensitive files based on a feature word bank. As is well known, the convolutional neural network starts from image recognition and is widely applied, and the method provided by the invention firstly solves the problem of how to convert document data into image-like data, and the solution is to represent the semantic content form of the document into a two-dimensional matrix characteristic form similar to the image data.
In order to achieve the above object, the present invention provides a sensitive file detection method based on a convolutional neural network, comprising: representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
In the above technical solution, preferably, a word vector matrix is formed based on deep learning of a convolutional neural network, a two-dimensional data matrix formed by the word vector feature extraction matrix is input to the convolutional neural network, and a common file and a sensitive file are output.
In the above technical solution, preferably, the actual processing process specifically includes: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer; and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.
Drawings
FIG. 1 is a schematic block diagram of a flow of a sensitive information filtering method based on a convolutional neural network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention is described in further detail below with reference to the attached drawing figures:
a prototype system is constructed on the basis of a sensitive file detection method based on a convolutional neural network by utilizing a word vector and the convolutional neural network to solve the problem of enterprise business secret file detection, and is roughly divided into 3 modules: the system comprises a word vector training set module, a sensitive file training module and a detection module, and the overall architecture of the system is shown in FIG. 1. Specifically, a semantic content form of a document is expressed as a two-dimensional matrix characteristic form similar to image data, and word vector characteristics are trained aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
The method comprises the steps of forming a word vector matrix based on deep learning of a convolutional neural network, inputting a two-dimensional data matrix formed by a word vector characteristic extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.
The actual processing process specifically comprises: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full-link layer; the detection module extracts word vector characteristics of the file to be detected, and then the sensitive file is identified through convolutional neural network calculation.

Claims (3)

1. A sensitive file detection method based on a convolutional neural network is characterized by comprising the following steps:
representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set;
combining the trained word vectors to obtain vector characteristics of the document;
and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
2. The sensitive document detection method based on the convolutional neural network as claimed in claim 1, wherein the deep learning based on the convolutional neural network forms a word vector matrix,
and inputting a two-dimensional data matrix formed by the word vector feature extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.
3. The sensitive file detection method based on the convolutional neural network as claimed in claim 1, wherein the actual processing procedure specifically comprises:
the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are;
sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer;
and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.
CN202010048855.7A 2020-01-16 2020-01-16 Sensitive file detection method based on convolutional neural network Pending CN111259116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010048855.7A CN111259116A (en) 2020-01-16 2020-01-16 Sensitive file detection method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010048855.7A CN111259116A (en) 2020-01-16 2020-01-16 Sensitive file detection method based on convolutional neural network

Publications (1)

Publication Number Publication Date
CN111259116A true CN111259116A (en) 2020-06-09

Family

ID=70952175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010048855.7A Pending CN111259116A (en) 2020-01-16 2020-01-16 Sensitive file detection method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN111259116A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391483A (en) * 2017-07-13 2017-11-24 武汉大学 A kind of comment on commodity data sensibility classification method based on convolutional neural networks
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN107863147A (en) * 2017-10-24 2018-03-30 清华大学 The method of medical diagnosis based on depth convolutional neural networks
CN109783614A (en) * 2019-01-25 2019-05-21 北京信息科技大学 A kind of the difference privacy leakage detection method and system of social networks text to be released
CN109871535A (en) * 2019-01-16 2019-06-11 四川大学 A kind of French name entity recognition method based on deep neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391483A (en) * 2017-07-13 2017-11-24 武汉大学 A kind of comment on commodity data sensibility classification method based on convolutional neural networks
CN107863147A (en) * 2017-10-24 2018-03-30 清华大学 The method of medical diagnosis based on depth convolutional neural networks
CN107835496A (en) * 2017-11-24 2018-03-23 北京奇虎科技有限公司 A kind of recognition methods of refuse messages, device and server
CN109871535A (en) * 2019-01-16 2019-06-11 四川大学 A kind of French name entity recognition method based on deep neural network
CN109783614A (en) * 2019-01-25 2019-05-21 北京信息科技大学 A kind of the difference privacy leakage detection method and system of social networks text to be released

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
林学峰: "基于卷积神经网络的敏感文件检测方法", 《计算机与现代化》 *

Similar Documents

Publication Publication Date Title
Zhao et al. idlg: Improved deep leakage from gradients
CN112163416B (en) Event joint extraction method for merging syntactic and entity relation graph convolution network
CN107070943B (en) Industrial internet intrusion detection method based on flow characteristic diagram and perceptual hash
CN107992764B (en) Sensitive webpage identification and detection method and device
CN112148997B (en) Training method and device for multi-modal countermeasure model for disaster event detection
CN109918647A (en) A kind of security fields name entity recognition method and neural network model
CN113420294A (en) Malicious code detection method based on multi-scale convolutional neural network
CN105574489A (en) Layered stack based violent group behavior detection method
CN113723330A (en) Method and system for understanding chart document information
CN111047428B (en) Bank high-risk fraud customer identification method based on small amount of fraud samples
CN112328792A (en) Optimization method for recognizing credit events based on DBSCAN clustering algorithm
Ni Face recognition based on deep learning under the background of big data
CN107992508B (en) Chinese mail signature extraction method and system based on machine learning
CN113064967B (en) Complaint reporting credibility analysis method based on deep migration network
CN111259116A (en) Sensitive file detection method based on convolutional neural network
CN117708561A (en) Information processing method, information processing device, electronic equipment and storage medium
CN112966296A (en) Sensitive information filtering method and system based on rule configuration and machine learning
CN118012776A (en) Software defect prediction method and system based on generation of countermeasure and pretraining model
CN114936615B (en) Small sample log information anomaly detection method based on characterization consistency correction
CN112270548B (en) Credit card fraud detection method based on deep learning
CN112116577B (en) Deep learning-based tamper portrait video detection method and system
CN114005004B (en) Fraud website identification method and system based on picture instance level characteristics
CN117216264A (en) Machine tool equipment fault analysis method and system based on BERT algorithm
CN112733144B (en) Intelligent malicious program detection method based on deep learning technology
CN112860648A (en) Intelligent analysis method based on log platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609