CN111259116A - Sensitive file detection method based on convolutional neural network - Google Patents
Sensitive file detection method based on convolutional neural network Download PDFInfo
- Publication number
- CN111259116A CN111259116A CN202010048855.7A CN202010048855A CN111259116A CN 111259116 A CN111259116 A CN 111259116A CN 202010048855 A CN202010048855 A CN 202010048855A CN 111259116 A CN111259116 A CN 111259116A
- Authority
- CN
- China
- Prior art keywords
- neural network
- convolutional neural
- sensitive
- vector
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 30
- 238000001514 detection method Methods 0.000 title claims abstract description 20
- 239000013598 vector Substances 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000013135 deep learning Methods 0.000 claims abstract description 7
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000002790 cross-validation Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000011161 development Methods 0.000 abstract description 2
- 230000007547 defect Effects 0.000 abstract 1
- 230000008901 benefit Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a sensitive file detection method based on a convolutional neural network, which comprises the following steps: 1) training out word vector characteristics aiming at a data set; 2) combining the trained word vectors to obtain vector characteristics of the document; 3) and taking the vector as an input characteristic of deep learning based on the convolutional neural network, and taking the vector as a method operation parameter to participate in an iterative training process. By the technical scheme, the sensitive information of the business files related to the business secrets of the enterprises, such as national secret files of national power development strategies, national significant power project planning and the like, and enterprise operation data and the like, is filtered, and the defects that the existing method based on the sensitive word bank detection is low in detection efficiency, and reduces the rate of missing reports and the rate of false reports are high are overcome.
Description
Technical Field
The invention relates to the technical field of information detection and filtering, in particular to a sensitive file detection method based on a convolutional neural network.
Background
The importance of monitoring sensitive information of key information infrastructure is particularly emphasized by the national network security law formally executed in 6/1/2017. The power industry is related to national economic life and national energy safety, and if a secret divulging event occurs, serious negative effects are certainly brought to the social image and the economic benefit of the industry. An effective network security monitoring system is established, and the important premise is to accurately identify sensitive data and then form an effective monitoring response scheme based on the sensitive data. Sensitive data in the operation process of the power enterprise are mainly divided into 1) national secret state secret documents of national power development strategy, national significant power project planning and the like; 2) enterprise business data, etc. relates to business secret files of the enterprise itself. The method focuses on how to effectively detect the business secret files in the enterprise business production. At present, most of sensitive file detection methods commonly used by power enterprises depend on a sensitive word bank, for example, secrecy, schemes, plans, internal data, bases and the like are partial keywords in the sensitive word bank, and the frequency of the keywords appearing in files is counted through a word segmentation algorithm and the like, so that whether the files belong to sensitive files is judged. The method has the advantages of high speed and high false alarm rate and missing alarm rate. For example, a novel contains words and phrases such as 'secret which cannot be spoken', 'plan next action', 'plan' and the like, the frequency of 'secret', 'plan' and 'plan' contained in the novel is counted according to the characteristic word library, the novel is wrongly judged as a sensitive office file, the sensitive file rechecking cost is increased, and illegal diffusion of the sensitive file is easily caused by manual rechecking.
The existing research is limited to solving the problem of enterprise business secret class file detection research or blank by using word vectors and convolutional neural networks.
Disclosure of Invention
Aiming at least one of the problems, the invention provides a sensitive information filtering method based on a convolutional neural network, and aims to solve the problem of low detection performance of sensitive files based on a feature word bank. As is well known, the convolutional neural network starts from image recognition and is widely applied, and the method provided by the invention firstly solves the problem of how to convert document data into image-like data, and the solution is to represent the semantic content form of the document into a two-dimensional matrix characteristic form similar to the image data.
In order to achieve the above object, the present invention provides a sensitive file detection method based on a convolutional neural network, comprising: representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
In the above technical solution, preferably, a word vector matrix is formed based on deep learning of a convolutional neural network, a two-dimensional data matrix formed by the word vector feature extraction matrix is input to the convolutional neural network, and a common file and a sensitive file are output.
In the above technical solution, preferably, the actual processing process specifically includes: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer; and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.
Drawings
FIG. 1 is a schematic block diagram of a flow of a sensitive information filtering method based on a convolutional neural network according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The invention is described in further detail below with reference to the attached drawing figures:
a prototype system is constructed on the basis of a sensitive file detection method based on a convolutional neural network by utilizing a word vector and the convolutional neural network to solve the problem of enterprise business secret file detection, and is roughly divided into 3 modules: the system comprises a word vector training set module, a sensitive file training module and a detection module, and the overall architecture of the system is shown in FIG. 1. Specifically, a semantic content form of a document is expressed as a two-dimensional matrix characteristic form similar to image data, and word vector characteristics are trained aiming at a data set; combining the trained word vectors to obtain vector characteristics of the document; and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
The method comprises the steps of forming a word vector matrix based on deep learning of a convolutional neural network, inputting a two-dimensional data matrix formed by a word vector characteristic extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.
The actual processing process specifically comprises: the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are; sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full-link layer; the detection module extracts word vector characteristics of the file to be detected, and then the sensitive file is identified through convolutional neural network calculation.
Claims (3)
1. A sensitive file detection method based on a convolutional neural network is characterized by comprising the following steps:
representing the semantic content form of the document into a two-dimensional matrix characteristic form similar to image data, and training word vector characteristics aiming at a data set;
combining the trained word vectors to obtain vector characteristics of the document;
and taking the vector features as input features of deep learning based on the convolutional neural network, and taking the vector features as method operation parameters to participate in an iterative training process.
2. The sensitive document detection method based on the convolutional neural network as claimed in claim 1, wherein the deep learning based on the convolutional neural network forms a word vector matrix,
and inputting a two-dimensional data matrix formed by the word vector feature extraction matrix into the convolutional neural network, and outputting a common file and a sensitive file.
3. The sensitive file detection method based on the convolutional neural network as claimed in claim 1, wherein the actual processing procedure specifically comprises:
the word vector training set mainly constructs a word vector model according to a corpus and converts documents written by human language into a form which can be recognized by a machine, and the more complete the corpus is, the more accurate the word vectors obtained by training are;
sensitive file feature training, namely taking word vectors as input of a convolutional neural network, and performing file classification, identification and learning through cross validation after operation of a convolutional layer, a pooling layer, a maxporoling layer and a full connection layer;
and the detection module extracts the word vector characteristics of the file to be detected, and then realizes sensitive file identification through the calculation of the convolutional neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010048855.7A CN111259116A (en) | 2020-01-16 | 2020-01-16 | Sensitive file detection method based on convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010048855.7A CN111259116A (en) | 2020-01-16 | 2020-01-16 | Sensitive file detection method based on convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111259116A true CN111259116A (en) | 2020-06-09 |
Family
ID=70952175
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010048855.7A Pending CN111259116A (en) | 2020-01-16 | 2020-01-16 | Sensitive file detection method based on convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111259116A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391483A (en) * | 2017-07-13 | 2017-11-24 | 武汉大学 | A kind of comment on commodity data sensibility classification method based on convolutional neural networks |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN107863147A (en) * | 2017-10-24 | 2018-03-30 | 清华大学 | The method of medical diagnosis based on depth convolutional neural networks |
CN109783614A (en) * | 2019-01-25 | 2019-05-21 | 北京信息科技大学 | A kind of the difference privacy leakage detection method and system of social networks text to be released |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
-
2020
- 2020-01-16 CN CN202010048855.7A patent/CN111259116A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391483A (en) * | 2017-07-13 | 2017-11-24 | 武汉大学 | A kind of comment on commodity data sensibility classification method based on convolutional neural networks |
CN107863147A (en) * | 2017-10-24 | 2018-03-30 | 清华大学 | The method of medical diagnosis based on depth convolutional neural networks |
CN107835496A (en) * | 2017-11-24 | 2018-03-23 | 北京奇虎科技有限公司 | A kind of recognition methods of refuse messages, device and server |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
CN109783614A (en) * | 2019-01-25 | 2019-05-21 | 北京信息科技大学 | A kind of the difference privacy leakage detection method and system of social networks text to be released |
Non-Patent Citations (1)
Title |
---|
林学峰: "基于卷积神经网络的敏感文件检测方法", 《计算机与现代化》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | idlg: Improved deep leakage from gradients | |
CN112163416B (en) | Event joint extraction method for merging syntactic and entity relation graph convolution network | |
CN107070943B (en) | Industrial internet intrusion detection method based on flow characteristic diagram and perceptual hash | |
CN107992764B (en) | Sensitive webpage identification and detection method and device | |
CN112148997B (en) | Training method and device for multi-modal countermeasure model for disaster event detection | |
CN109918647A (en) | A kind of security fields name entity recognition method and neural network model | |
CN113420294A (en) | Malicious code detection method based on multi-scale convolutional neural network | |
CN105574489A (en) | Layered stack based violent group behavior detection method | |
CN113723330A (en) | Method and system for understanding chart document information | |
CN111047428B (en) | Bank high-risk fraud customer identification method based on small amount of fraud samples | |
CN112328792A (en) | Optimization method for recognizing credit events based on DBSCAN clustering algorithm | |
Ni | Face recognition based on deep learning under the background of big data | |
CN107992508B (en) | Chinese mail signature extraction method and system based on machine learning | |
CN113064967B (en) | Complaint reporting credibility analysis method based on deep migration network | |
CN111259116A (en) | Sensitive file detection method based on convolutional neural network | |
CN117708561A (en) | Information processing method, information processing device, electronic equipment and storage medium | |
CN112966296A (en) | Sensitive information filtering method and system based on rule configuration and machine learning | |
CN118012776A (en) | Software defect prediction method and system based on generation of countermeasure and pretraining model | |
CN114936615B (en) | Small sample log information anomaly detection method based on characterization consistency correction | |
CN112270548B (en) | Credit card fraud detection method based on deep learning | |
CN112116577B (en) | Deep learning-based tamper portrait video detection method and system | |
CN114005004B (en) | Fraud website identification method and system based on picture instance level characteristics | |
CN117216264A (en) | Machine tool equipment fault analysis method and system based on BERT algorithm | |
CN112733144B (en) | Intelligent malicious program detection method based on deep learning technology | |
CN112860648A (en) | Intelligent analysis method based on log platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200609 |