CN118535951A

CN118535951A - SQL attack identification method and system based on deep learning dynamic target range feature fusion

Info

Publication number: CN118535951A
Application number: CN202410613633.3A
Authority: CN
Inventors: 罗恩韬; 陈国睿; 李连胜; 张彬; 汪蔚
Original assignee: Hunan University of Science and Engineering
Current assignee: Hunan University of Science and Engineering
Priority date: 2024-05-17
Filing date: 2024-05-17
Publication date: 2024-08-23

Abstract

The present invention discloses a SQL attack identification method and system based on deep learning dynamic target range feature fusion, including: S1, building a target server to simulate a real SQL injection attack and obtain an injection attack sample; S2, decoding and detecting the obtained injection attack sample data packet to obtain an injection attack statement text data set; S3, combining the word frequency text frequency index and the SuperTerm_Vector word vector algorithm, converting the text data of the injection attack statement in the data set into the corresponding numerical feature; S4, using the injection attack statement numerical vector feature input LC-CNN model for classification model training; S5, using the trained LC-CNN model to identify the test data and determine whether there is an SQL attack. The present invention has achieved significant advantages in training efficiency and classification accuracy, and has good robustness and generalization ability as a whole.

Description

SQL attack identification method and system based on deep learning dynamic target range feature fusion

技术领域Technical Field

本发明涉及一种基于深度学习动态靶场特征融合的SQL攻击识别方法，具体属于网络安全和人工智能深度学习交叉技术。The present invention relates to a SQL attack identification method based on deep learning dynamic target range feature fusion, which specifically belongs to the intersection technology of network security and artificial intelligence deep learning.

背景技术Background Art

随着Web系统的广泛使用和大数据创造的巨大价值，数据库的安全问题越来越受到人们的关注。在数据库面临的众多安全风险中，Web应用程序运行中容易受到SQL注入攻击的影响，SQL注入攻击者通过注入恶意SQL语句，获取web网站的敏感信息，这些语句被输入到后端执行时将会对数据库进行攻击，从而获得应用程序底层数据库的权限，使攻击者可以自由访问该数据库及其包含的潜在信息，例如窃取数据库信息、个人隐私数据、用户密码等，进而实现任意恶意删除，篡改等，从而带来Web系统数据、用户隐私或敏感数据的安全威胁。根据国家最新信息安全漏洞库公布的数据，目前SQL注入漏洞占互联网所报告Web漏洞的11.67％。因此，对于SQL注入攻击的检测和防御至关重要。With the widespread use of Web systems and the huge value created by big data, database security issues are attracting more and more attention. Among the many security risks faced by databases, Web applications are vulnerable to SQL injection attacks during operation. SQL injection attackers obtain sensitive information of web sites by injecting malicious SQL statements. When these statements are input into the backend for execution, they will attack the database, thereby obtaining the permissions of the underlying database of the application, allowing attackers to freely access the database and the potential information it contains, such as stealing database information, personal privacy data, user passwords, etc., and then achieve arbitrary malicious deletion and tampering, etc., which brings security threats to Web system data, user privacy or sensitive data. According to the data published by the latest national information security vulnerability database, SQL injection vulnerabilities currently account for 11.67% of Web vulnerabilities reported on the Internet. Therefore, it is crucial to detect and defend against SQL injection attacks.

SQL注入检测研究中使用的数据集包括公共数据集、自定义数据集以及公共数据集与自定义数据集混合的数据集。传统的SQL注入检测方法主要依靠规则匹配和特征提取等技术，但这些方法往往难以适应不断变化的攻击手段。机器学习作为一种新兴的技术，可以对大量的数据进行学习和分析，从而识别和预测未知的攻击手段，但是如何获取大量又可靠的数据又是一个困难问题。此外，虽然近年来针对SQL注入攻击识别和特征提取的研究取得了重要进展，各种模型在检测SQL注入攻击中各有优势，但依然存在一些关键问题仍亟待解决，这主要包括：The datasets used in SQL injection detection research include public datasets, custom datasets, and datasets that are a mixture of public and custom datasets. Traditional SQL injection detection methods mainly rely on techniques such as rule matching and feature extraction, but these methods are often difficult to adapt to the ever-changing attack methods. Machine learning, as an emerging technology, can learn and analyze large amounts of data to identify and predict unknown attack methods, but how to obtain large amounts of reliable data is a difficult problem. In addition, although research on SQL injection attack identification and feature extraction has made important progress in recent years, and various models have their own advantages in detecting SQL injection attacks, there are still some key issues that need to be resolved, including:

(1)数据不平衡的问题。在实际的应用场景中，正常查询远远多于恶意查询，这将直接影响模型在学习过程中，倾向于将大多数查询都归类为正常查询，直接导致了数据不平衡的问题，此外，攻击者可以通过多种方式构造恶意的SQL查询，以绕过检测，从而降低了对恶意查询的辨别能力。(1) Data imbalance. In actual application scenarios, normal queries far outnumber malicious queries, which will directly affect the model during the learning process and tend to classify most queries as normal queries, directly leading to the problem of data imbalance. In addition, attackers can construct malicious SQL queries in a variety of ways to bypass detection, thereby reducing the ability to identify malicious queries.

(2)攻击多样性和变异性问题。针对全新的的攻击变异，这对于那些依赖于已知样本的模型来说是一个严重的问题，有可能合法查询被错误地标记为恶意攻击导致高误报率，而某些恶意查询没有被检测到会导致高漏报率。因此进一步削弱了预测的准确性和可靠性，从而导致攻击漏检或误报率居高不下。(2) Attack diversity and variability. This is a serious problem for models that rely on known samples. It is possible that legitimate queries are mistakenly marked as malicious attacks, resulting in a high false positive rate, while some malicious queries are not detected, resulting in a high false negative rate. This further weakens the accuracy and reliability of the prediction, resulting in missed attacks or high false positive rates.

另外，当前公开可用的SQL注入攻击数据集非常有限。这主要受到了法律和数据安全方面的限制，以及网站所有者的数据保护策略的影响。通常情况下，网站所有者不允许未经授权的扫描和攻击行为，同时他们还经常采用数据加密技术来保护敏感信息的传输。这些因素导致了公开获取SQL注入攻击数据的困难。即使存在一些公开的数据集，这些数据集可能仅包含有限类型的SQL注入攻击样本，缺乏足够的变化和复杂性，无法全面反映SQL注入攻击的多种形式和技巧，还存在多样性不足的问题，无法满足广泛的研究和实验需求。In addition, the current publicly available SQL injection attack datasets are very limited. This is mainly due to legal and data security restrictions, as well as the data protection policies of website owners. Generally, website owners do not allow unauthorized scanning and attack behaviors, and they often use data encryption technology to protect the transmission of sensitive information. These factors make it difficult to publicly obtain SQL injection attack data. Even if there are some public datasets, these datasets may only contain limited types of SQL injection attack samples, lack sufficient variation and complexity, and cannot fully reflect the various forms and techniques of SQL injection attacks. There is also a problem of insufficient diversity, which cannot meet the needs of extensive research and experiments.

发明内容Summary of the invention

本发明解决的技术问题是：针对现有网络SQL攻击检测中采集的公开可用的SQL注入攻击数据集有限导致的漏检和误报率高的问题，提供一种基于深度学习动态靶场特征融合的SQL攻击识别方法。The technical problem solved by the present invention is: to provide a SQL attack identification method based on deep learning dynamic target range feature fusion to address the problems of missed detection and high false alarm rate caused by limited publicly available SQL injection attack data sets collected in existing network SQL attack detection.

本发明采用如下技术方案实现：The present invention is implemented by the following technical solutions:

一方面提供一种基于深度学习动态靶场特征融合的SQL攻击识别方法，包括如下步骤：On the one hand, a SQL attack identification method based on deep learning dynamic target range feature fusion is provided, comprising the following steps:

S1、搭建靶向服务器模拟真实SQL注入攻击，获取注入攻击样本；S1. Build a targeted server to simulate a real SQL injection attack and obtain injection attack samples;

S2、对获取的注入攻击样本数据包进行解码并检测，获得注入攻击语句文本数据集，对文本数据集中的注入攻击语句进行分词处理；S2, decoding and detecting the acquired injection attack sample data packets, obtaining the injection attack sentence text data set, and performing word segmentation processing on the injection attack sentences in the text data set;

S3、结合词频文本频率指数和SuperTerm_Vector词向量算法对分词处理后的文本数据集进行处理，将数据集中注入攻击语句的文本数据转化为对应的数值向量特征；S3, combining the word frequency index and the SuperTerm_Vector word vector algorithm to process the text data set after word segmentation, and convert the text data of the attack sentence injected in the data set into the corresponding numerical vector features;

S4、将S3处理得到的注入攻击语句数值向量特征和正常的SQL查询语句数值向量特征组合作为训练数据集，输入LC-CNN模型进行分类模型训练，所述LC-CNN模型设置两层卷积层，所述卷积层后连接扁平化层，再连接到全连接层；S4, combining the numerical vector features of the injection attack statements obtained by S3 and the numerical vector features of the normal SQL query statements as a training data set, and inputting them into the LC-CNN model for classification model training, wherein the LC-CNN model is provided with two convolutional layers, the convolutional layers are connected to a flattening layer, and then to a fully connected layer;

S5、使用S4中训练好的LC-CNN模型对测试数据进行识别，判断是否存在SQL攻击。S5. Use the LC-CNN model trained in S4 to identify the test data and determine whether there is a SQL attack.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述步骤S1中，采用PHP Study工具建立仿真环境，搭建本地sqli-libs靶场服务器，采用SQLMAP接口函数对本地靶场服务器中WEB应用程序进行探测和扫描，利用SQLMAP自动化工具模拟自动执行SQL注入攻击，并通过Wireshark抓包工具捕获真实有效的注入数据，获得注入攻击样本。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, in step S1, the PHP Study tool is used to establish a simulation environment, a local sqli-libs target range server is built, the SQLMAP interface function is used to detect and scan the WEB application in the local target range server, the SQLMAP automation tool is used to simulate and automatically execute SQL injection attacks, and the real and effective injection data is captured by the Wireshark packet capture tool to obtain injection attack samples.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述步骤S2中对S1获取的注入攻击样本数据包中的加密数据进行识别解码，对注入攻击样本数据依次判断是否属于Base64解码格式和Unicode解码格式，并进行解码转化为UTF-8编码格式输出注入攻击语句文本，并对输出的注入攻击语句文本进行简化处理，将其中的十进制数字转化为0×12，日期和时间替换为1-1-1，重写的关键字只保留一个，删除注入攻击语句中的噪音字符。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, in step S2, the encrypted data in the injection attack sample data packet obtained by S1 is identified and decoded, and the injection attack sample data is judged in turn whether it belongs to the Base64 decoding format and the Unicode decoding format, and is decoded and converted into the UTF-8 encoding format to output the injection attack statement text, and the output injection attack statement text is simplified, the decimal numbers therein are converted into 0×12, the date and time are replaced with 1-1-1, only one rewritten keyword is retained, and the noise characters in the injection attack statement are deleted.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述步骤S2采用空格分割法对解码后的注入攻击语句进行分词处理，将SQL注入攻击语句划分为字符串序列，在其前后添加空格。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, step S2 uses a space segmentation method to perform word segmentation on the decoded injection attack statement, divides the SQL injection attack statement into a string sequence, and adds spaces before and after it.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述步骤S3中，所述词频文本频率指数通过下式计算：In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, in step S3, the word frequency text frequency index is calculated by the following formula:

TF-IDF＝TF×IDF，TF-IDF＝TF×IDF，

式中TF-IDF表示词频文本频率指数，TF(i)表示词i在文本中出现的频率，IDF(i)表示词i的重要性指数，Total(i)表示注入攻击语句中词i的频数，Total表示注入攻击语句总词数，T(i)表示包含词i语句的频数，φ表示偏移量。Where TF-IDF represents the term frequency index, TF(i) represents the frequency of word i in the text, IDF(i) represents the importance index of word i, Total(i) represents the frequency of word i in the injected attack sentence, Total represents the total number of words in the injected attack sentence, T(i) represents the frequency of sentences containing word i, and φ represents the offset.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述SuperTerm_Vector词向量算法将注入攻击语句中每个词表示一个向量，将输入的文本数据映射到多维空间，包括如下过程：In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, the SuperTerm_Vector word vector algorithm represents each word in the injected attack sentence as a vector, and maps the input text data to a multidimensional space, including the following process:

构建语料库，收集公开的文本数据作为语料库，对语料库中的文本数据进行预处理，包括分词、去除标点符号、转换为小写、去除停用词，获得干净的文本数据；Build a corpus, collect public text data as the corpus, and preprocess the text data in the corpus, including word segmentation, punctuation removal, conversion to lowercase, and stop word removal to obtain clean text data;

构建词汇表，从预处理后的语料库中构建一个词汇表，包含语料库中出现的所有唯一单词；Build a vocabulary from the preprocessed corpus, containing all the unique words that appear in the corpus;

构建训练样本，基于词汇表构建训练样本，每个训练样本由一个中心词和其周围的上下文词组成；Construct training samples based on the vocabulary. Each training sample consists of a central word and its surrounding context words.

定义模型结构，选择Skip-gram模型通过中心词预测周围的上下文词，将训练样本输入定义的Skip-gram模型结构进行训练，训练过程中使用负对数似然损失函数最大化预测上下文词的概率来调整模型参数；Define the model structure, select the Skip-gram model to predict the surrounding context words through the central word, input the training samples into the defined Skip-gram model structure for training, and use the negative log-likelihood loss function to maximize the probability of predicting the context words to adjust the model parameters during the training process;

获取词向量，将步骤S2中分词后的文本数据集输入训练完成的Skip-gram模型，文本数据集中每个单词都被映射到词向量空间中，每个词表示为对应映射的长度向量，根据注入攻击语句中在语义空间中相似度接近的单词在向量空间中距离也接近的原则，通过以下方式计算空间词向量的距离：Obtain word vectors. Input the text dataset after word segmentation in step S2 into the trained Skip-gram model. Each word in the text dataset is mapped to the word vector space. Each word is represented as a length vector of the corresponding mapping. According to the principle that words with close similarity in the semantic space in the injection attack sentence are also close in distance in the vector space, the distance of the spatial word vectors is calculated in the following way:

注入攻击语句包含n个单词w₁，w₂，w，...，w_n，该攻击语句在SuperTerm_Vector算法中所有单词的词向量列表表示为v₁，v₂，v₃，...，v_n，使用余弦相似度给定两个词向量v_μ和v_v的余弦相似度similarity(v_μ，v_v)计算公式如下：The injection attack sentence contains n words w ₁ , w ₂ , w , ... , w _n . The word vector list of all words in the attack sentence in the SuperTerm_Vector algorithm is represented as v ₁ , v ₂ , v ₃ , ... , v _n . The cosine similarity (v _μ , v _v ) of two word vectors v _μ and v _v is calculated using the following formula:

其中·表示向量的点积，||v_μ||和||υ_ν||分别表示词向量v_μ和v_v的范数，根据余弦相似度计算注入攻击语句中不同词之间的相似度，对应获得不同单词在词向量空间中的距离。Where · represents the dot product of the vectors, ||v _μ || and ||υ _ν || represent the norms of the word vectors v _μ and v _v , respectively. The similarity between different words in the injected attack sentence is calculated based on the cosine similarity, and the distance between different words in the word vector space is obtained accordingly.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，对于注入攻击语句中的每个单词，将其对应的空间词向量与TF-IDF相乘，得到每个单词的加权词向量V′_w，注入攻击语句的平均加权词向量表示为n表示注入攻击语句包含的词数量，W表示注入攻击语句中包含的单词，D表示注入攻击语句，以注入攻击语句的平均加权词向量作为数值向量特征对LC-CNN模型进行分类模型训练。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, for each word in the injection attack sentence, its corresponding spatial word vector is multiplied by TF-IDF to obtain the weighted word vector V′ _w of each word. The average weighted word vector of the injection attack sentence is expressed as n represents the number of words contained in the injection attack sentence, W represents the words contained in the injection attack sentence, and D represents the injection attack sentence. The average weighted word vector of the injection attack sentence is used as the numerical vector feature to train the LC-CNN model for classification.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述步骤S4中，所述卷积层采用ReLU函数作为激活函数，所述全连接层采用Sigmoid函数作为激活函数，所述扁平化层采用Tanh函数作为激活函数。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, in step S4, the convolution layer adopts the ReLU function as the activation function, the fully connected layer adopts the Sigmoid function as the activation function, and the flattened layer adopts the Tanh function as the activation function.

在本发明的基于深度学习动态靶场特征融合的SQL攻击识别方法中，具体的，所述LC-CNN模型添加model.compile()函数用于定义模型的优化器、损失函数和评估指标，所述优化器采用Adam优化器，所述损失函数使用二元交叉熵，所述评估指标包括确率、F1值以及混淆矩阵，分类模型多次迭代训练过程中通过确定最高分类准确率的迭代来确定最优模型参数设置。In the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention, specifically, the LC-CNN model adds a model.compile() function to define the optimizer, loss function and evaluation index of the model, the optimizer adopts the Adam optimizer, the loss function uses binary cross entropy, the evaluation index includes accuracy, F1 value and confusion matrix, and the optimal model parameter setting is determined by determining the iteration with the highest classification accuracy during multiple iterative training of the classification model.

本发明还公开了一种基于深度学习动态靶场特征融合的SQL攻击识别系统，包括：The present invention also discloses a SQL attack identification system based on deep learning dynamic target range feature fusion, including:

数据采集模块，搭建靶向服务器模拟真实SQL注入攻击，获取注入攻击样本；Data collection module, builds a target server to simulate real SQL injection attacks and obtain injection attack samples;

数据清洗模块，对获取的注入攻击样本数据包进行解码并检测，获得注入攻击语句文本数据集，对文本数据集中的注入攻击语句进行分词处理；The data cleaning module decodes and detects the acquired injection attack sample data packets, obtains the injection attack sentence text data set, and performs word segmentation processing on the injection attack sentences in the text data set;

特征提取模块，结合词频文本频率指数和SuperTerm_Vector词向量算法对分词处理后的文本数据集进行处理，将数据集中注入攻击语句的文本数据转化为对应的数值特征；The feature extraction module combines the word frequency index and the SuperTerm_Vector word vector algorithm to process the text data set after word segmentation, and converts the text data injected into the attack statement in the data set into corresponding numerical features;

LC-CNN分类器，内置LC-CNN模型，所述LC-CNN模型设置两层卷积层，所述卷积层后连接扁平化层，再连接到全连接层，通过输入特征提取模块的注入攻击语句数值特征与正常的SQL查询语句数值向量特征组合的训练数据集来进行分类模型训练，再使用训练好的LC-CNN模型对测试数据进行识别，判断是否存在SQL攻击。The LC-CNN classifier has a built-in LC-CNN model. The LC-CNN model has two convolutional layers, which are connected to a flattening layer and then to a fully connected layer. The classification model is trained by inputting a training data set composed of the numerical features of the injection attack statements of the feature extraction module and the numerical vector features of the normal SQL query statements. The trained LC-CNN model is then used to identify the test data to determine whether there is an SQL attack.

本发明采用上述技术方案具有如下有益效果：The present invention adopts the above technical solution to achieve the following beneficial effects:

(1)本发明通过自主创建SQL注入攻击靶场，搭建本地sqli-libs的靶场服务器，利用基于SQLMAP接口函数的全局探测和自动扫描功能，对本地靶机WEB应用程序进行深入探测和扫描，主动模拟SQL注入攻击并观察WEB应用程序的响应情况，收集攻击数据和漏洞信息，确保数据采集过程受到控制，并且可以满足研究和实验的需求，实现对SQL注入攻击的研究和防御能力的提升。同时，也能够对攻击情景进行精确的控制和监测，以保障安全性和合规性。有助于克服数据获取的限制，并提供更全面、多样的SQL注入攻击数据样本，以满足在研究和实验方面的需求，同时也有助于提升对SQL注入攻击的防范和检测能力。(1) The present invention independently creates a SQL injection attack range, builds a local sqli-libs range server, and uses the global detection and automatic scanning functions based on the SQLMAP interface function to deeply detect and scan the local target machine WEB application, actively simulate SQL injection attacks and observe the response of the WEB application, collect attack data and vulnerability information, ensure that the data collection process is controlled, and can meet the needs of research and experiments, and realize the improvement of research and defense capabilities against SQL injection attacks. At the same time, it can also accurately control and monitor attack scenarios to ensure security and compliance. It helps to overcome the limitations of data acquisition and provide more comprehensive and diverse SQL injection attack data samples to meet the needs of research and experiments, and also helps to improve the prevention and detection capabilities of SQL injection attacks.

(2)本发明使用Wireshark工具对本地环回网卡进行数据捕获，这些数据包包含攻击过程中的请求、响应和传输数据。最后，因为捕获到的数据包可能会包含加密或混淆的信息，因此需要识别解码加密数据、检测SQL注入语句的存在，并进行分词等预处理工作。通过这些处理步骤，能够准备清晰、可理解的数据，使其适用于机器学习模型的训练和分析，从而提高模型的精准度和可适应性。(2) The present invention uses the Wireshark tool to capture data from the local loopback network card. These data packets contain requests, responses, and transmission data during the attack process. Finally, because the captured data packets may contain encrypted or obfuscated information, it is necessary to identify and decode the encrypted data, detect the existence of SQL injection statements, and perform preprocessing such as word segmentation. Through these processing steps, clear and understandable data can be prepared, making it suitable for training and analysis of machine learning models, thereby improving the accuracy and adaptability of the model.

(3)本发明采用空格分割法将SQL注入语句划分为字符串序列，在其前后添加空格的方式，以保持其完整性。通过这一处理步骤，有助于捕捉特殊字符的出现位置和上下文，便于更好地理解SQL注入攻击语句的结构和语法，从而识别潜在的攻击模式，并为机器学习模型提供清晰、有意义的输入数据，从而提高对SQL注入攻击的防范和检测能力。(3) The present invention uses a space segmentation method to divide the SQL injection statement into a string sequence, and adds spaces before and after it to maintain its integrity. This processing step helps to capture the occurrence position and context of special characters, facilitates a better understanding of the structure and grammar of the SQL injection attack statement, thereby identifying potential attack patterns, and provides clear and meaningful input data for the machine learning model, thereby improving the prevention and detection capabilities of SQL injection attacks.

(4)本发明为将文本数据转化为数值特征的结构化数据过程，鉴于词频文本频率指数能够更准确地反映文本数据中的信息，而基于Word2Vec词向量模型的SuperTerm_Vector词向量算法能够准确提高其在捕捉文本语义和上下文关系方面的性能，因此本发明结合词频文本频率指数和SuperTerm_Vector词向量算法对SQL注入攻击的文本数据进行转化，将数据转化为可以被分类模型所理解的数值特征，以确保最终的结构化数据能够有效地用于分类模型的训练。(4) The present invention is a process of converting text data into structured data with numerical features. Since the word frequency text frequency index can more accurately reflect the information in the text data, and the SuperTerm_Vector word vector algorithm based on the Word2Vec word vector model can accurately improve its performance in capturing text semantics and contextual relationships, the present invention combines the word frequency text frequency index and the SuperTerm_Vector word vector algorithm to convert the text data of SQL injection attacks, and converts the data into numerical features that can be understood by the classification model, so as to ensure that the final structured data can be effectively used for the training of the classification model.

(5)本发明采用改进的LC-CNN模型，首先设置了两个卷积层，卷积核大小为64×1，深度为64，步长为1，随后连接了一个扁平化层(flatten layer)，最后接入全连接层。第一轮卷积后参数数量为203,392个。在此基础上增加了另一层卷积层，旨在提取更多特征，第二轮卷积后参数数量为1,056个。传统的CNN模型中池化层通常用于减少参数数量，但在经过两次卷积后，参数并不适合进行池化操作，因此本发明取消了池化层，直接进行扁平化操作，以便进行后续的二元分类。这一设计旨在增强模型对一维文本数据的处理能力，并为后续的训练任务提供有力支持。(5) The present invention adopts an improved LC-CNN model. First, two convolution layers are set up with a convolution kernel size of 64×1, a depth of 64, and a step size of 1. Then a flatten layer is connected, and finally a fully connected layer is connected. The number of parameters after the first round of convolution is 203,392. On this basis, another convolution layer is added to extract more features. The number of parameters after the second round of convolution is 1,056. In traditional CNN models, the pooling layer is usually used to reduce the number of parameters, but after two convolutions, the parameters are not suitable for pooling operations. Therefore, the present invention cancels the pooling layer and directly performs a flattening operation for subsequent binary classification. This design aims to enhance the model's ability to process one-dimensional text data and provide strong support for subsequent training tasks.

综上所述，本发明针对SQL注入攻击数据集稀缺和特征提取不完善的挑战，提出了一种融合词频文本频率指数和SuperTerm_Vector词向量算法的特征提取模型，并结合靶向服务器模拟攻击情境的方法进行了研究，通过机器学习技术对标准化的SQL注入数据集进行训练，优化了卷积神经网络中的模型参数，相较于传统的Web应用源代码SQL注入攻击检测方式，本发明在训练效率和分类准确率上取得了显著优势，在SQL攻击动作识别和不同分级漏洞检测方面达到了90％以上的检测效果，整体具有较好的鲁棒性和泛化能力。In summary, in response to the challenges of scarce SQL injection attack data sets and imperfect feature extraction, the present invention proposes a feature extraction model that integrates the word frequency text frequency index and the SuperTerm_Vector word vector algorithm, and conducts research in combination with a method of simulating attack scenarios with targeted servers. The standardized SQL injection data set is trained through machine learning technology, and the model parameters in the convolutional neural network are optimized. Compared with the traditional Web application source code SQL injection attack detection method, the present invention has achieved significant advantages in training efficiency and classification accuracy, and has achieved a detection effect of more than 90% in SQL attack action recognition and different graded vulnerability detection, and has good overall robustness and generalization ability.

以下结合附图和具体实施方式对本发明做进一步说明。The present invention is further described below in conjunction with the accompanying drawings and specific embodiments.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明基于深度学习动态靶场特征融合的SQL攻击识别方法的步骤流程图。FIG1 is a flowchart of the steps of the SQL attack identification method based on deep learning dynamic target range feature fusion of the present invention.

图2为本发明基于基于深度学习动态靶场特征融合的SQL攻击识别系统的系统示意图。FIG2 is a system schematic diagram of a SQL attack identification system based on deep learning dynamic target range feature fusion according to the present invention.

图3为实施例中采用本发明训练LC-CNN分类器模型的混淆矩阵示意图。FIG3 is a schematic diagram of a confusion matrix of an LC-CNN classifier model trained using the present invention in an embodiment.

图4为实施例中本发明构建数据集与现有公开数据集的训练结果对比示意图。FIG4 is a schematic diagram showing a comparison of training results of a dataset constructed by the present invention and an existing public dataset in an embodiment.

图5a、5b为实施例中本发明的LC-CNN分类器模型和逻辑回归模型单独对比训练的准确率和召回率曲线变化示意图。5a and 5b are schematic diagrams showing changes in accuracy and recall curves of the LC-CNN classifier model and the logistic regression model of the present invention when trained separately.

图6a、6b为实施例中本发明的LC-CNN分类器模型和随机森林模型单独对比训练的准确率和召回率曲线变化示意图。6a and 6b are schematic diagrams showing changes in accuracy and recall curves of the LC-CNN classifier model and the random forest model trained separately in the embodiment.

图7a、7b为实施例中本发明的LC-CNN分类器模型和贝叶斯模型单独对比训练的准确率和召回率曲线变化示意图。7a and 7b are schematic diagrams showing changes in accuracy and recall curves of the LC-CNN classifier model of the present invention and the Bayesian model when trained separately in an embodiment.

图8a、8b为实施例中本发明的LC-CNN分类器模型和决策树模型单独对比训练的准确率和召回率曲线变化示意图。8a and 8b are schematic diagrams showing changes in accuracy and recall curves of the LC-CNN classifier model and the decision tree model of the present invention when compared and trained separately in the embodiments.

具体实施方式DETAILED DESCRIPTION

实施例Example

参见图1，图示为本发明基于深度学习动态靶场特征融合的SQL攻击识别方法的步骤流程示意，具体包括如下步骤：Referring to FIG. 1 , a flowchart of a SQL attack identification method based on deep learning dynamic target range feature fusion according to the present invention is shown, which specifically includes the following steps:

S1、搭建靶向服务器模拟真实SQL注入攻击，获取注入攻击样本。S1. Build a targeted server to simulate a real SQL injection attack and obtain injection attack samples.

采用PHP Study工具建立仿真环境，搭建本地sqli-libs靶场服务器，采用SQLMAP接口函数对本地靶场服务器中WEB应用程序进行探测和扫描，利用SQLMAP自动化工具模拟自动执行SQL注入攻击，并通过Wireshark抓包工具捕获真实有效的注入数据，获得注入攻击样本。The PHP Study tool is used to establish a simulation environment, build a local sqli-libs target range server, use the SQLMAP interface function to detect and scan the WEB application in the local target range server, use the SQLMAP automation tool to simulate the automatic execution of SQL injection attacks, and use the Wireshark packet capture tool to capture real and effective injection data to obtain injection attack samples.

S2、对获取的注入攻击样本数据包进行解码并检测，获得注入攻击语句文本数据集，对文本数据集中的注入攻击语句进行分词处理。S2. Decode and detect the acquired injection attack sample data packets to obtain an injection attack sentence text data set, and perform word segmentation processing on the injection attack sentences in the text data set.

对获取的注入攻击样本数据包中的加密数据进行识别解码，对注入攻击样本数据依次判断是否属于Base64解码格式和Unicode解码格式，并进行解码转化为UTF-8编码格式输出注入攻击语句文本，并对输出的注入攻击语句文本进行简化处理，将其中的十进制数字转化为0×12，日期和时间替换为1-1-1，重写的关键字只保留一个，删除注入攻击语句中的噪音字符。然后，采用空格分割法对解码后的注入攻击语句进行分词处理，将SQL注入攻击语句划分为字符串序列，在其前后添加空格。The encrypted data in the acquired injection attack sample data packet is identified and decoded. The injection attack sample data is judged in turn whether it belongs to the Base64 decoding format and the Unicode decoding format, and is decoded and converted into the UTF-8 encoding format to output the injection attack statement text. The output injection attack statement text is simplified, the decimal numbers are converted into 0×12, the date and time are replaced with 1-1-1, only one rewritten keyword is retained, and the noise characters in the injection attack statement are deleted. Then, the decoded injection attack statement is segmented using the space segmentation method, and the SQL injection attack statement is divided into a string sequence, and spaces are added before and after it.

在对获取的注入攻击样本数据包解码获取原始请求信息后，还通过以下方式对获取的注入攻击语句进行检测来判断是否SQL注入攻击语句。After decoding the obtained injection attack sample data packet to obtain the original request information, the obtained injection attack statement is also detected in the following manner to determine whether it is an SQL injection attack statement.

(1)收集现有的注入攻击样本。通过在公开的漏洞报告、安全论坛或已知的安全数据集中搜索已知的SQL注入攻击样本。这些样本通常包括恶意用户尝试利用漏洞或注入攻击的SQL语句。(1) Collect existing injection attack samples. Search for known SQL injection attack samples in public vulnerability reports, security forums, or known security datasets. These samples usually include SQL statements that malicious users attempt to exploit vulnerabilities or injection attacks.

(2)构建数据集。将收集到的注入攻击样本整理成一个数据集，包括注入攻击语句以及相应的标签，如正常查询标签或注入攻击标签。(2) Constructing a dataset. The collected injection attack samples are organized into a dataset, including injection attack statements and corresponding labels, such as normal query labels or injection attack labels.

(3)基于已知的SQL注入攻击样本或模式，构建攻击列表，包含已知的注入攻击语句模式。同时也可以构建一个安全列表，包含正常的SQL查询语句模式。然后通过对比待检测的注入攻击样本与以上列表中的语句模式进行匹配，来判断是否存在注入攻击。(3) Based on known SQL injection attack samples or patterns, an attack list is constructed, which includes known injection attack statement patterns. At the same time, a safe list can also be constructed, which includes normal SQL query statement patterns. Then, by comparing the injection attack sample to be detected with the statement patterns in the above list, it is determined whether there is an injection attack.

(4)可以使用现有已深度学习模型对SQL注入攻击数据集进行训练，并使用监督学习方法建立分类模型，以区分正常的SQL查询和注入攻击语句。(4) The existing deep learning model can be used to train the SQL injection attack dataset, and a classification model can be established using supervised learning methods to distinguish between normal SQL queries and injection attack statements.

S3、结合词频文本频率指数和SuperTerm_Vector词向量算法对分词处理后的文本数据集进行处理，将数据集中注入攻击语句的文本数据转化为对应的数值特征。S3. Combine the word frequency index and the SuperTerm_Vector word vector algorithm to process the text data set after word segmentation, and convert the text data of the attack sentences injected into the data set into corresponding numerical features.

在步骤S3中，词频文本频率指数通过下式计算：In step S3, the word frequency text frequency index is calculated by the following formula:

TF-IDF＝TF×IDF，TF-IDF＝TF×IDF，

SuperTerm_Vector词向量算法将注入攻击语句中每个词表示一个向量，将输入的文本数据映射到多维空间，每个词表示为对应映射的长度向量，使得注入攻击语句中在语义空间中距离接近的词在向量空间中距离也接近。The SuperTerm_Vector word vector algorithm represents each word in the injection attack sentence as a vector, maps the input text data to a multidimensional space, and represents each word as a length vector of the corresponding mapping, so that words in the injection attack sentence that are close in distance in the semantic space are also close in distance in the vector space.

S4、将S3处理得到的注入攻击语句数值向量特征和正常的SQL查询语句数值向量特征添加作为训练数据集，输入LC-CNN模型进行分类模型训练，LC-CNN模型设置两层卷积层，卷积层后连接扁平化层，再连接到全连接层。使用的卷积层过滤器大小是64×1，深度是64，步长是1，卷积之后参数有203392个，对此在这层基础上再加一个卷积层进行提取特征，此次提取过后参数有1056个，传统神经网络模型中的池化层主要作用是减少参数数量，但是在本发明中经过两次卷积后参数并不适合池化，因此本发明取消池化层直接进行扁平化操作，为下面二元判断做准备，这个设计使得模型能够更好地处理一维文本数据，为后续的任务训练提供了有力支撑。S4, add the numerical vector features of the injection attack statements and the normal SQL query statements obtained by S3 as training data sets, input the LC-CNN model for classification model training, and set two convolutional layers in the LC-CNN model, connect the flattening layer after the convolutional layer, and then connect to the fully connected layer. The convolutional layer filter size used is 64×1, the depth is 64, and the step size is 1. There are 203392 parameters after convolution. On this basis, another convolutional layer is added to extract features. After this extraction, there are 1056 parameters. The main function of the pooling layer in the traditional neural network model is to reduce the number of parameters, but in the present invention, the parameters are not suitable for pooling after two convolutions. Therefore, the present invention cancels the pooling layer and directly performs the flattening operation to prepare for the following binary judgment. This design enables the model to better process one-dimensional text data and provides strong support for subsequent task training.

步骤S4中，卷积层采用ReLU函数作为激活函数，全连接层采用Sigmoid函数作为激活函数，扁平化层采用Tanh函数作为激活函数。在此基础上，LC-CNN模型添加model.compile()函数用于定义模型的优化器、损失函数和评估指标，优化器采用Adam优化器，损失函数使用二元交叉熵，评估指标包括确率、F1值以及混淆矩阵，分类模型多次迭代训练过程中通过确定最高分类准确率的迭代来确定最优模型参数设置。In step S4, the convolution layer uses the ReLU function as the activation function, the fully connected layer uses the Sigmoid function as the activation function, and the flattened layer uses the Tanh function as the activation function. On this basis, the LC-CNN model adds the model.compile() function to define the optimizer, loss function and evaluation index of the model. The optimizer uses the Adam optimizer, the loss function uses the binary cross entropy, and the evaluation indicators include the accuracy, F1 value and confusion matrix. During the multiple iterative training of the classification model, the optimal model parameter settings are determined by determining the iteration with the highest classification accuracy.

LC-CNN分类模型的训练还要收集正常的SQL查询语句添加为训练数据集。确保数据集具有足够的代表性和多样性，以提高模型的泛化能力。对于正常的SQL查询语句数据进行步骤S2、S3的预处理，将SQL查询语句转换为适合神经网络输入的格式。使用包含正常的SQL查询语句和SQL注入攻击语句数据的数据集对LC-CNN模型进行训练。在训练过程中，模型将学习如何区分正常的SQL查询语句和恶意的SQL注入攻击语句。The training of the LC-CNN classification model also requires the collection of normal SQL query statements as training data sets. Ensure that the data sets are sufficiently representative and diverse to improve the generalization ability of the model. Perform preprocessing of steps S2 and S3 on normal SQL query statement data to convert the SQL query statements into a format suitable for neural network input. Use a data set containing normal SQL query statements and SQL injection attack statement data to train the LC-CNN model. During the training process, the model will learn how to distinguish between normal SQL query statements and malicious SQL injection attack statements.

参见图2，本实施例还提供了一种采用上述SQL攻击识别方法的系统，包括数据采集模块100、数据清洗模块200、特征提取模块300和LC-CNN分类器400。2 , this embodiment further provides a system using the above-mentioned SQL attack identification method, including a data collection module 100 , a data cleaning module 200 , a feature extraction module 300 and an LC-CNN classifier 400 .

数据采集模块100用于搭建靶向服务器模拟真实SQL注入攻击，获取注入攻击样本。The data collection module 100 is used to build a target server to simulate a real SQL injection attack and obtain injection attack samples.

数据采集模块100采用sqli-libs作为靶场，在本地搭建靶向服务器，有效地结合自动化工具和真实攻击情境，在构建SQL注入攻击靶场时，首先采用PHP Study工具建立一个仿真环境，旨在模拟真实网络中应用程序中常见漏洞和弱点，以便在受控环境中进行SQL注入攻击，并进行攻击数据的收集。结合开启SQLMAP的自动扫描与探测功能，模拟不断的SQL注入攻击，以获取多样化、真实性强的注入攻击样本。The data collection module 100 uses sqli-libs as a target range, builds a target server locally, effectively combines automation tools and real attack scenarios, and first uses the PHP Study tool to establish a simulation environment when building a SQL injection attack target range, aiming to simulate common vulnerabilities and weaknesses in applications in real networks, so as to conduct SQL injection attacks in a controlled environment and collect attack data. Combined with the automatic scanning and detection functions of SQLMAP, continuous SQL injection attacks are simulated to obtain diversified and authentic injection attack samples.

SQL注入攻击通常在系统表单或URL查询字符串后，插入特殊字符构造非法的SQL语句，并作为输入参数经由source源结点传播至sink汇聚结点，再传递至Web应用程序。当靶向服务器端执行输入非法SQL语句时，攻击方会对关系型数据库存储的xml文档漏洞发起攻击，从而实现自己所要执行的恶意操作。为模拟SQL注入攻击的执行流程，本发明的数据采集模块100通过Wireshark抓包工具捕获真实有效的注入数据，并采用SQLMAP自动化工具模拟自动执行SQL注入攻击。利用Wireshark工具捕获本地网卡数据样本，根据响应信息来判断目标网站是否存在SQL注入漏洞，在本地网卡数据样本中查找向目标网站发出的http中注入语句是否成功返回数据库信息，成功则表明注入成功，说明存在注入漏洞，强化了数据的全面性与真实性，为进一步分析奠定了坚实基础。SQL injection attack usually inserts special characters to construct illegal SQL statements after the system form or URL query string, and propagates to the sink aggregation node via the source node as an input parameter, and then passes it to the Web application. When the target server-side executes the input illegal SQL statement, the attacker will attack the XML document vulnerability stored in the relational database, thereby realizing the malicious operation to be performed by himself. To simulate the execution process of SQL injection attack, the data acquisition module 100 of the present invention captures real and effective injection data through the Wireshark packet capture tool, and uses the SQLMAP automation tool to simulate and automatically execute SQL injection attack. Utilize the Wireshark tool to capture local network card data samples, judge whether the target website has SQL injection vulnerability according to the response information, find whether the injection statement in the http sent to the target website successfully returns the database information in the local network card data sample, and success indicates that the injection is successful, indicating that there is an injection vulnerability, strengthening the comprehensiveness and authenticity of the data, and laying a solid foundation for further analysis.

数据清洗模块200对获取的注入攻击样本数据包进行解码并检测，获得注入攻击语句文本数据集，对文本数据集中的注入攻击语句进行分词处理。The data cleaning module 200 decodes and detects the acquired injection attack sample data packets, obtains an injection attack sentence text data set, and performs word segmentation processing on the injection attack sentences in the text data set.

数据采集模块100发现新的SQL注入攻击样本就保存，直到SQLMAP自动扫描结束为止，因为捕获到的数据包可能会包含加密或混淆的信息，因此需要数据清洗模块200识别解码注入攻击样本中的加密数据、检测SQL注入语句的存在，并进行分词等数据清洗工作，在数据清洗过程中，需要解密或还原被修改的注入语句，以获得原始的SQL注入内容。数据采集模块100首先采用SQL动态靶场数据特征解码算法，对输入的注入攻击文本数据先进行判断是否属于Base64解码格式，是则对其进行Base64解码，后输出，若不是则再判断是否是Unicode解码格式，是则对其进行Unicode解码，若不是则返回0表示无法解码。对输出的注入攻击语句文本进行简化处理，将文本数据中的十六进制数字转化为0x12，将日期和时间全部替换成”1-1-1”，把过度重写的关键字只保留一个，删除sql语句中无意义的噪音字符，然后进行分词处理。The data acquisition module 100 saves the new SQL injection attack sample when it finds it, until the SQLMAP automatic scan is completed. Because the captured data packet may contain encrypted or obfuscated information, the data cleaning module 200 is required to identify and decode the encrypted data in the injection attack sample, detect the existence of SQL injection statements, and perform data cleaning work such as word segmentation. In the data cleaning process, it is necessary to decrypt or restore the modified injection statement to obtain the original SQL injection content. The data acquisition module 100 first uses the SQL dynamic target range data feature decoding algorithm to first determine whether the input injection attack text data belongs to the Base64 decoding format. If it is, it is Base64 decoded and then output. If not, it is determined whether it is in the Unicode decoding format. If it is, it is Unicode decoded. If not, 0 is returned to indicate that it cannot be decoded. The output injection attack statement text is simplified, the hexadecimal numbers in the text data are converted to 0x12, the date and time are all replaced with "1-1-1", only one over-rewritten keyword is retained, and the meaningless noise characters in the SQL statement are deleted, and then word segmentation is performed.

分词过程的精确性和特殊字符的处理方式对后续的模型训练和检测任务至关重要。在考虑SQL注入攻击语句中特殊字符的关键作用时，为在分词过程中保持特殊字符的原始形态，需要将它们识别为独立的标记。因此，数据清洗模块200采用空格分割法将SQL注入语句划分为字符串序列，在其前后添加空格的方式，以保持其完整性。通过这一处理步骤，有助于捕捉特殊字符的出现位置和上下文，便于更好地理解SQL注入攻击语句的结构和语法，从而识别潜在的攻击模式，并为机器学习模型提供清晰、有意义的输入数据，从而提高对SQ L注入攻击的防范和检测能力。The accuracy of the word segmentation process and the way special characters are handled are crucial to subsequent model training and detection tasks. When considering the key role of special characters in SQL injection attack statements, in order to maintain the original form of special characters in the word segmentation process, they need to be identified as independent tags. Therefore, the data cleaning module 200 uses a space segmentation method to divide the SQL injection statement into a string sequence, and adds spaces before and after it to maintain its integrity. Through this processing step, it is helpful to capture the occurrence position and context of special characters, facilitate a better understanding of the structure and grammar of SQL injection attack statements, thereby identifying potential attack patterns, and providing clear and meaningful input data for machine learning models, thereby improving the prevention and detection capabilities of SQL injection attacks.

特征提取模块300用于结合词频文本频率指数和SuperTerm_Vector词向量算法对分词处理后的文本数据集进行处理，将数据集中注入攻击语句的文本数据转化为对应的数值特征。The feature extraction module 300 is used to process the text data set after word segmentation by combining the word frequency text frequency index and the SuperTerm_Vector word vector algorithm, and convert the text data injected into the attack sentence in the data set into corresponding numerical features.

注入攻击数据被清洗干净，接下来的步骤是将它们转化为特征向量的形式，将注入数据转换为机器学习模型可以理解的数值特征，以便于模型训练。转换的特征包括SQL注入攻击语句中的关键词、特殊字符的出现频率、语句长度等等。待完成特征提取后，这些特征向量可用于构建机器学习模型，以识别SQL注入攻击。特征提取模块300基于词袋模型(Bag-of-Words Model)，结合词频文本频率指数和SuperTerm_Vector词向量算法对清洗后的注入攻击数据进行处理，将数据转化为可以被分类模型所理解的数值特征，以确保最终的结构化数据能够有效地用于分类模型的训练。The injection attack data is cleaned, and the next step is to convert them into the form of feature vectors, converting the injection data into numerical features that can be understood by the machine learning model to facilitate model training. The converted features include keywords in the SQL injection attack statement, the frequency of occurrence of special characters, the length of the statement, etc. After the feature extraction is completed, these feature vectors can be used to build a machine learning model to identify SQL injection attacks. The feature extraction module 300 processes the cleaned injection attack data based on the Bag-of-Words Model, combined with the word frequency text frequency index and the SuperTerm_Vector word vector algorithm, and converts the data into numerical features that can be understood by the classification model to ensure that the final structured data can be effectively used for the training of the classification model.

具体的，词频文本频率指数算法(简称TF-IDF)是一种关键词提取算法，通过以下公式进行计算：Specifically, the term frequency text frequency index algorithm (TF-IDF for short) is a keyword extraction algorithm, which is calculated by the following formula:

TF-IDF＝TF×IDF，TF-IDF＝TF×IDF，

式中TF-IDF表示词频文本频率指数，即将词在文本中的频率和该词的重要性相乘得到权重值，权重值越大表示该词越重要。TF(i)表示词i在文本中出现的频率，通过将某个词在文本中出现的次数除以文本中所有词的总数来进行计算，IDF(i)表示词i的重要性指数，因为出现频率高不一定是关键词，还有可能是常用词，所以IDF(i)的目的是减小常用词的权重，变相增加关键词的权重，Total(i)表示注入攻击语句中词i的频数，Total表示注入攻击语句总词数，T(i)表示包含词i语句的频数，表示偏移量，用来防止当总文档库为零时导致log函数计算错误。In the formula, TF-IDF represents the frequency index of the word frequency text, that is, the frequency of the word in the text and the importance of the word are multiplied to get the weight value. The larger the weight value, the more important the word is. TF(i) represents the frequency of word i in the text, which is calculated by dividing the number of times a word appears in the text by the total number of all words in the text. IDF(i) represents the importance index of word i. Because words with high frequency of appearance are not necessarily keywords, they may also be common words. Therefore, the purpose of IDF(i) is to reduce the weight of common words and increase the weight of keywords in disguise. Total(i) represents the frequency of word i in the injected attack sentence. Total represents the total number of words in the injected attack sentence. T(i) represents the frequency of sentences containing word i. Indicates the offset, which is used to prevent the log function from calculating incorrectly when the total document library is zero.

但是词频文本频率指数算法如果直接运用到SQL注入攻击的特征建模当中，一方面是该算法过度关注关键词的权重，而忽视该关键词所在位置和其他词之间的联系，另一方面是词频文本频率指数算法是在词袋模型基础上产生的算法，该算法的缺陷就是文本向量化后维数会多达数千个，会导致后续模型训练量过大的问题，导致模型训练不精确的情况，因此本发明的特征提取模块300还引入SuperTerm_Vector词向量算法进一步降维和优化，以提升关键词提取的准确性和效率。However, if the word frequency text frequency index algorithm is directly applied to the feature modeling of SQL injection attacks, on the one hand, the algorithm pays too much attention to the weight of the keyword and ignores the relationship between the location of the keyword and other words. On the other hand, the word frequency text frequency index algorithm is an algorithm generated based on the bag-of-words model. The defect of this algorithm is that the dimension of the text after vectorization will be as high as thousands, which will lead to the problem of excessive subsequent model training and inaccurate model training. Therefore, the feature extraction module 300 of the present invention also introduces the SuperTerm_Vector word vector algorithm for further dimensionality reduction and optimization to improve the accuracy and efficiency of keyword extraction.

SuperTerm_Vector算法是将SQL注入攻击语句中每个词表示为一个向量，将输入的文本数据映射到多维空间里，每个词都可表示为一定长度的向量，并使得在语义空间中距离接近的单词在向量空间中距离也接近，经过SuperTerm_Vector算法转换后的词向量既包含了上下语义的联系，也包含了该词的位置信息。这样就解决了模型中维数过多的问题。The SuperTerm_Vector algorithm represents each word in the SQL injection attack statement as a vector, maps the input text data into a multi-dimensional space, and each word can be represented as a vector of a certain length, and makes the words that are close in the semantic space close in the vector space. The word vector converted by the SuperTerm_Vector algorithm contains both the connection between the context and the location information of the word. This solves the problem of too many dimensions in the model.

SuperTerm_Vector算法获得注入攻击语句中每个词的词向量的过程如下：The process of obtaining the word vector of each word in the injection attack sentence by the SuperTerm_Vector algorithm is as follows:

构建语料库，收集公开的文本数据作为语料库，可以是任何形式的文本数据，如文章、新闻、博客、维基百科等，对语料库中的文本数据进行预处理，包括分词、去除标点符号、转换为小写、去除停用词，获得干净的文本数据；Build a corpus and collect public text data as the corpus. It can be any form of text data, such as articles, news, blogs, Wikipedia, etc. Preprocess the text data in the corpus, including word segmentation, punctuation removal, conversion to lowercase, and stop word removal to obtain clean text data;

构建训练样本，基于词汇表构建训练样本，每个训练样本由一个中心词和其周围的上下文词组成，上下文词可以是中心词前后固定窗口大小内的单词；Construct training samples based on the vocabulary. Each training sample consists of a central word and surrounding context words. The context words can be words within a fixed window size before and after the central word.

定义模型结构，选择Skip-gram模型作为通过中心词预测周围的上下文词，将训练样本输入定义的Skip-gram模型结构进行训练，训练过程中使用负对数似然损失函数最大化预测上下文词的概率来调整模型参数；Define the model structure, select the Skip-gram model as the center word to predict the surrounding context words, input the training samples into the defined Skip-gram model structure for training, and use the negative log-likelihood loss function to maximize the probability of predicting the context words to adjust the model parameters during the training process;

获取词向量，将步骤S2中分词后的文本数据集输入训练完成的Skip-gram模型，文本数据集中每个单词都被映射到词向量空间中，每个词表示为对应映射的长度向量，使得注入攻击语句中在语义空间中相似度接近的单词在向量空间中距离也接近。Obtain word vectors, input the text data set after word segmentation in step S2 into the trained Skip-gram model, and each word in the text data set is mapped to the word vector space. Each word is represented as a length vector of the corresponding mapping, so that words with close similarity in the semantic space in the injected attack sentence are also close in distance in the vector space.

通过以下方式计算空间词向量的距离：The distance of the spatial word vector is calculated as follows:

语料库中包含了所有出现的唯一单词，对于每个单词w_i，我们使用一个预训练的词向量表示，记为v_i。则单词w_i的词向量可以表示为：w_i→v_i。The corpus contains all the unique words that appear. For each word _wi , we use a pre-trained word vector representation, denoted as _vi . Then the word vector of word _wi can be expressed as: _wi → _vi .

注入攻击语句S包含n个单词w₁，w₂，w，...，w_n，该语句在SuperTerm_Vector算法中表示为包含所有单词的词向量列表S→[v₁，v₂，v₃，...，v_n]，使用余弦相似度给定两个词向量υ_μ和v_v的余弦相似度similarity(v_μ，v_v)计算公式如下：The injection attack sentence S contains n words w ₁ , w ₂ , w , ... , w _n . The sentence is represented as a word vector list S → [v ₁ , v ₂ , v ₃ , ... , v _n ] containing all words in the SuperTerm_Vector algorithm. The cosine similarity (v _μ , v _v ) of two word vectors υ _μ and v _v is calculated using the following formula:

其中·表示向量的点积，||v_μ||和||v_ν||分别表示词向量v_μ和v_v的范数；根据余弦相似度计算获得注入攻击语句中不同词之间的相似度以及语句之间的相似度，两个注入攻击语句在语义空间中的相似度很高，那么它们在向量空间中的距离表示也会很接近。Where · represents the dot product of the vectors, ||v _μ || and ||v _ν || represent the norms of the word vectors v _μ and v _v respectively. The similarity between different words in the injection attack sentence and the similarity between sentences are calculated based on the cosine similarity. If the similarity of two injection attack sentences in the semantic space is very high, then their distance representation in the vector space will also be very close.

对于注入攻击语句中的每个单词W，找到其对应的空间词向量V_W，将表示该单词在向量空间中的词向量距离的余弦相似度与其对应的TF-IDF得分相乘，得到每个单词的加权词向量V′_w，注入攻击语句的平均加权词向量表示为n表示注入攻击语句包含的词数量，W表示注入攻击语句中包含的单词，D表示注入攻击语句。For each word W in the injection attack sentence, find its corresponding space word vector V _W , multiply the cosine similarity representing the word vector distance in the vector space with its corresponding TF-IDF score to obtain the weighted word vector V′ _w for each word. The average weighted word vector of the injection attack sentence is expressed as n represents the number of words contained in the injection attack sentence, W represents the words contained in the injection attack sentence, and D represents the injection attack sentence.

通过这个过程，我们获得了每个注入攻击语句文档的加权词向量表示，其中词向量的权重由其对应的TF-IDF得分决定，这样的表示综合考虑了词语的频率和语义信息。Through this process, we obtain a weighted word vector representation for each injection attack statement document, where the weight of the word vector is determined by its corresponding TF-IDF score. Such a representation comprehensively considers the frequency and semantic information of the word.

虽然词频文本频率指数(TF-IDF)算法和SuperTerm_Vector算法可以解决关键词特征词频权重以及特征提取后维数过多而难以训练的问题，但是面临恶意代码的语法结构或者复杂的非线性关系时，传统的卷积神经网络(CNN)就难以达到很好的训练效果。因此本发明基于卷积神经网络(CNN)解析SQL注入攻击中复杂的非线性关系，自动学习输入数据的特征，来高效识别SQL注入攻击中恶意代码的语法结构或特定的攻击模式，同时利用卷积神经网络(CN N)模型对输入数据进行快速的推断和检测，构建了本发明的LC-CNN分类器模型，从而来极大地提高对SQL注入攻击实时检测效率和精准度。Although the TF-IDF algorithm and the SuperTerm_Vector algorithm can solve the problem of keyword feature word frequency weights and too many dimensions after feature extraction and difficult training, it is difficult for traditional convolutional neural networks (CNNs) to achieve good training results when facing the grammatical structure of malicious code or complex nonlinear relationships. Therefore, the present invention is based on the convolutional neural network (CNN) to parse the complex nonlinear relationship in SQL injection attacks, automatically learn the characteristics of input data, and efficiently identify the grammatical structure or specific attack mode of malicious code in SQL injection attacks. At the same time, the convolutional neural network (CNN) model is used to quickly infer and detect the input data, and the LC-CNN classifier model of the present invention is constructed, thereby greatly improving the real-time detection efficiency and accuracy of SQL injection attacks.

该LC-CNN分类器模型设置两层卷积层，卷积层后连接扁平化层，再连接到全连接层，通过输入特征提取模块的注入攻击语句数值特征训练分类模型，再使用训练好的LC-CNN模型对测试数据进行识别，判断是否存在SQL攻击。The LC-CNN classifier model sets up two convolution layers, which are connected to the flattening layer and then to the fully connected layer. The classification model is trained by inputting the numerical features of the injection attack statements of the feature extraction module, and then the trained LC-CNN model is used to identify the test data to determine whether there is a SQL attack.

在LC-CNN分类器模型中，首先设置了两个卷积层，卷积核大小为64×1，深度为64，步长为1。随后连接了一个扁平化层(flatten layer)，最后接入全连接层。第一轮卷积后参数数量为203,392个。在此基础上增加了另一层卷积层，旨在提取更多特征，第二轮卷积后参数数量为1,056个。池化层通常用于减少参数数量，但在经过两次卷积后，参数并不适合进行池化操作，因此取消了池化层，直接进行扁平化操作，以便进行后续的二元分类。这一设计旨在增强模型对一维文本数据的处理能力，并为后续的训练任务提供有力支持。In the LC-CNN classifier model, two convolutional layers are first set up with a kernel size of 64×1, a depth of 64, and a stride of 1. A flatten layer is then connected, and finally a fully connected layer is connected. The number of parameters after the first round of convolution is 203,392. On this basis, another convolutional layer is added to extract more features, and the number of parameters after the second round of convolution is 1,056. Pooling layers are usually used to reduce the number of parameters, but after two convolutions, the parameters are not suitable for pooling operations, so the pooling layer is cancelled and the flattening operation is directly performed for subsequent binary classification. This design aims to enhance the model's ability to handle one-dimensional text data and provide strong support for subsequent training tasks.

LC-CNN分类器模型的构建用于进行二元分类任务，即对输入数据进行类别判定。ReLU激活函数因其在计算速度、训练稳定性以及收敛速度方面具有的优势，被选用作卷积层的激活函数，而在全连接层，为了输出能够被解释为类别概率的数值，选择Sigmoid函数作为激活函数，扁平化层采用Tanh函数作为激活函数。这样的函数组合使得模型能够更有效地完成二元分类任务，并在提升性能和效率方面发挥重要作用。The LC-CNN classifier model is constructed for binary classification tasks, that is, to determine the category of input data. The ReLU activation function is selected as the activation function of the convolutional layer due to its advantages in calculation speed, training stability, and convergence speed. In the fully connected layer, in order to output a numerical value that can be interpreted as a category probability, the Sigmoid function is selected as the activation function, and the flattened layer uses the Tanh function as the activation function. This combination of functions enables the model to complete binary classification tasks more effectively and plays an important role in improving performance and efficiency.

Sigmoid函数是将输出限制在0到1之间的范围，具体计算如下式所示：The Sigmoid function limits the output to the range between 0 and 1. The specific calculation is shown in the following formula:

Tanh函数是将输出压缩为-1到1之间的范围，具体计算如下式所示：The Tanh function compresses the output to a range between -1 and 1. The specific calculation is shown in the following formula:

Relu函数是将输入中所有的负值都归为零，正值不变，具体计算如公式(6)所示：The ReLU function returns all negative values in the input to zero, and leaves the positive values unchanged. The specific calculation is shown in formula (6):

在此基础上，LC-CNN分类器模型添加model.compile()函数用于定义模型的优化器、损失函数和评估指标。这里优化器使用的是Adam优化器(Adam optimizer)，损失函数使用的是二元交叉熵(Binary Crossentropy)，这些函数将在训练过程中输出模型在训练集和测试集上的分类准确率，从而监控和评估模型的训练过程和效果，从而有助于监控和评估模型的效果。LC-CNN分类器模型训练过程经过多次迭代，选择准确率最高的一次迭代模型参数作为此分类器的参数。On this basis, the LC-CNN classifier model adds the model.compile() function to define the model's optimizer, loss function, and evaluation indicators. The optimizer here uses the Adam optimizer, and the loss function uses the binary cross entropy. These functions will output the classification accuracy of the model on the training set and the test set during the training process, so as to monitor and evaluate the training process and effect of the model, which helps to monitor and evaluate the effect of the model. The LC-CNN classifier model training process goes through multiple iterations, and the model parameters of the iteration with the highest accuracy are selected as the parameters of this classifier.

在本发明中，准确率和F1值可以用来衡量LC-CNN分类器模型的整体性能，而混淆矩阵则可以提供详细的分类结果信息，帮助分析LC-CNN分类器模型在不同类别上的表现，通过这些比较，可以证明特征提取和攻击识别在我们的LC-CNN分类器模型中是否有效，并验证本数据集的构建是否成功，因此本发明的LC-CNN分类器模型采用准确率、F1值以及混淆矩阵作为评价指标，以便更准确地理解模型在攻击检测任务中的表现，验证数据集的质量，评估模型的性能。In the present invention, accuracy and F1 value can be used to measure the overall performance of the LC-CNN classifier model, while the confusion matrix can provide detailed classification result information to help analyze the performance of the LC-CNN classifier model in different categories. Through these comparisons, it can be proved whether feature extraction and attack identification are effective in our LC-CNN classifier model, and whether the construction of this data set is successful. Therefore, the LC-CNN classifier model of the present invention uses accuracy, F1 value and confusion matrix as evaluation indicators to more accurately understand the performance of the model in the attack detection task, verify the quality of the data set, and evaluate the performance of the model.

准确率(Precision)是指LC-CNN分类器模型正确预测的正例数占所有预测为正例样本数的比例。高准确率表示LC-CNN分类器能够较少地将负例误分类为正例，从而提供了更可靠的预测结果。然而，在某些不平衡数据集中，高准确率可能并不总是最合适的评价指标，因为它有可能忽略了分类器在少数类别上的性能。因此，在不同情况下，准确率需要与召回率、F1值等一起综合考虑，以全面评估分类器的性能，准确率Pricision具体计算如下式：Precision refers to the ratio of the number of positive examples correctly predicted by the LC-CNN classifier model to the number of all samples predicted as positive examples. A high precision means that the LC-CNN classifier can misclassify negative examples as positive examples less often, thus providing more reliable prediction results. However, in some unbalanced data sets, a high precision may not always be the most appropriate evaluation metric, because it may ignore the performance of the classifier on minority categories. Therefore, in different cases, the precision needs to be considered together with the recall rate, F1 value, etc. to comprehensively evaluate the performance of the classifier. The specific calculation of the precision is as follows:

True Positives(真正例TP)表示LC-CNN分类器正确预测出正例的攻击样本数，即样本为正，预测结果为正；False Positives(假正例FP)表示LC-CNN分类器错误地将负例攻击样本预测为正例的样本数，即样本为负，预测结果为正；True Negatives(真负例TN)表示LC-CNN分类器正确预测出负例的攻击样本数，即样本为负，预测结果为负；FalseNegatives(假负例FN)表示LC-CNN分类器错误地将正例攻击样本预测为负例的样本数，即样本为正，预测结果为负。True Positives (True Positives TP) indicates the number of attack samples that the LC-CNN classifier correctly predicts as positive examples, that is, the samples are positive and the prediction result is positive; False Positives (False Positives FP) indicates the number of negative attack samples that the LC-CNN classifier incorrectly predicts as positive examples, that is, the samples are negative and the prediction result is positive; True Negatives (True Negatives TN) indicates the number of attack samples that the LC-CNN classifier correctly predicts as negative examples, that is, the samples are negative and the prediction result is negative; False Negatives (False Negatives FN) indicates the number of positive attack samples that the LC-CNN classifier incorrectly predicts as negative examples, that is, the samples are positive and the prediction result is negative.

召回率(Recall)是指LC-CNN分类器正确预测正例数占实际正例总数的比例。主要用来衡量LC-CNN分类器在识别正例时的覆盖能力和全面性。召回率和准确率之间存在一种权衡关系，召回率的取值范围在0到1之间，越接近1表示模型在检测正例方面表现得越好，召回率Recall具体计算如下式：Recall refers to the ratio of the number of positive examples correctly predicted by the LC-CNN classifier to the total number of actual positive examples. It is mainly used to measure the coverage and comprehensiveness of the LC-CNN classifier in identifying positive examples. There is a trade-off between recall and precision. The range of recall is between 0 and 1. The closer to 1, the better the model performs in detecting positive examples. The specific calculation of recall is as follows:

F1值是用于综合评估LC-CNN分类器性能的指标，可以更全面地反映LC-CNN分类器的性能，它同时考虑了分类器的召回率和准确率，综合了误分类和漏分类的影响，用于平衡模型在正例和负例分类上的性能表现，F1值的范围在0和1之间，它越接近1表示模型在召回率和准确率之间取得了良好的平衡，具体计算如下式所示：The F1 value is an indicator used to comprehensively evaluate the performance of the LC-CNN classifier. It can more comprehensively reflect the performance of the LC-CNN classifier. It takes into account the recall rate and precision rate of the classifier at the same time, and combines the influence of misclassification and missed classification. It is used to balance the performance of the model in the classification of positive and negative examples. The range of the F1 value is between 0 and 1. The closer it is to 1, the better the model has achieved in terms of recall rate and precision rate. The specific calculation is shown in the following formula:

混淆矩阵(Confusion Matrix)是用来衡量LC-CNN分类器分类结果的一种表格形式，混淆矩阵P准确率计算如下式所示：The confusion matrix is a tabular form used to measure the classification results of the LC-CNN classifier. The accuracy calculation of the confusion matrix P is shown as follows:

通过以下具体实例对采用本发明的SQL攻击识别方法和常规机器学习算法模型的对比训练比对说明。The following specific examples are used to illustrate the comparative training of the SQL attack identification method of the present invention and a conventional machine learning algorithm model.

将本发明的SQL攻击识别系统的实验环境CPU为Intel(R)Core(TM)i7-8565U CPU@1.80GHz，使用的系统为window10版，python3.7.3，anaconda3，jupyter6.5.2，tensorflow2.12。为实验准备的测试集27000条数据中，随机将数据的80％作为训练数据，20％作为测试数据。The experimental environment CPU of the SQL attack identification system of the present invention is Intel (R) Core (TM) i7-8565U CPU @ 1.80GHz, the system used is window10 version, python3.7.3, anaconda3, jupyter6.5.2, tensorflow2.12. Among the 27,000 test data prepared for the experiment, 80% of the data are randomly used as training data and 20% as test data.

在本发明的LC-CNN分类器模型训练结果中，conv1D是一维卷积层，每层都有32个过滤器和一个步长为1的滑动窗口，该层有203392个可训练参数，第二次卷积后有1056个参数，然后flatten层将输出展平为一维向量，以便将其输入到全连接层。在全连接层进行二分类，一共迭代八次，训练损失值是5.60％，准确率98.16％，验证集损失值是9％，这个指标用于评估模型的泛化能力，即模型在未见过的数据上的表现。验证集准确率是97％，表示类似验证损失集。In the training results of the LC-CNN classifier model of the present invention, conv1D is a one-dimensional convolution layer, each layer has 32 filters and a sliding window with a step size of 1. This layer has 203392 trainable parameters, and there are 1056 parameters after the second convolution. Then the flatten layer flattens the output into a one-dimensional vector so that it can be input into the fully connected layer. Binary classification is performed in the fully connected layer, and a total of eight iterations are performed. The training loss value is 5.60%, the accuracy rate is 98.16%, and the validation set loss value is 9%. This indicator is used to evaluate the generalization ability of the model, that is, the performance of the model on unseen data. The validation set accuracy is 97%, indicating a similar validation loss set.

用测试数据和LC-CNN分类器模型预测数据构建混淆矩阵，结果如图3所示，经过比较发现，样本值为正的预测值也是正，样本值为负时候预测值同样是负，可以表明模型对数据的预测准确，模型设定参数符合预期。The confusion matrix is constructed using the test data and the LC-CNN classifier model prediction data. The results are shown in Figure 3. After comparison, it is found that when the sample value is positive, the prediction value is also positive, and when the sample value is negative, the prediction value is also negative. This shows that the model predicts the data accurately and the model setting parameters are in line with expectations.

进一步的，本实施例选择了GitHub上的公开数据集(https://github.com/sql-injection/test-dataset)和kaggle的SQL注入数据集(https://www.kaggle.com/datasets/kholoodsalah/sql-injection-dataset)与LC-CNN分类器的数据集进行对照实验，用于探究不同数据集在模型性能方面的差异。通过对这两个公开数据集进行预处理GitHub数据集为4521维，kaggle数据集维数有6674维。不同数据集的训练结果对比如图4所示，采用本发明构建的数据集模型在性能上与Kaggle的数据集相差不大，而GitHub的数据集稍微优于其他两个数据集，主要是因为数据量较少，使得模型更易于学习并产生更佳效果。综合考量，LC-CNN分类器模型的训练效果在与公开数据集的对比测试中呈现出令人满意的性能，预示着其在SQL实际检测任务中可能具备重要作用。Further, this embodiment selects the public data set on GitHub (https://github.com/sql-injection/test-dataset) and the SQL injection data set of kaggle (https://www.kaggle.com/datasets/kholoodsalah/sql-injection-dataset) to conduct a comparative experiment with the data set of the LC-CNN classifier to explore the differences in model performance between different data sets. By preprocessing these two public data sets, the GitHub data set is 4521 dimensions, and the kaggle data set has 6674 dimensions. The training results of different data sets are compared as shown in Figure 4. The data set model constructed by the present invention is not much different from the Kaggle data set in performance, while the GitHub data set is slightly better than the other two data sets, mainly because the amount of data is small, making the model easier to learn and produce better results. Comprehensive consideration, the training effect of the LC-CNN classifier model shows satisfactory performance in the comparative test with the public data set, indicating that it may play an important role in the actual SQL detection task.

然后采用本发明的构建的数据集与其他现有的机器学习模型进行对比训练，本实施例对照选用的现有机器学习模型包括逻辑回归模型、随机森林模型、贝叶斯模型、决策树模型。Then, the data set constructed by the present invention is used for comparative training with other existing machine learning models. The existing machine learning models selected for comparison in this embodiment include logistic regression model, random forest model, Bayesian model, and decision tree model.

模型对比训练的准确率结果如下表1所示，The accuracy results of the model comparison training are shown in Table 1 below.

迭代次数Iterations 逻辑回归Logistic Regression 随机森林Random Forest 贝叶斯Bayesian 决策树Decision Tree LC-CNNLC-CNN 11 0.905916310.90591631 0.861760460.86176046 0.87454650.8745465 0.808802310.80880231 0.92990.9299 22 0.906493510.90649351 0.889466090.88946609 0.872814970.87281497 0.811688310.81168831 0.97620.9762 33 0.908225110.90822511 0.903058280.90305828 0.874180650.87418065 0.812409810.81240981 0.97950.9795 44 0.929292930.92929293 0.905916310.90591631 0.874015750.87401575 0.817821070.81782107 0.98030.9803 55 0.93046740.9304674 0.913708510.91370851 0.873356140.87335614 0.819264070.81926407 0.98150.9815 66 0.93046740.9304674 0.923232320.92323232 0.874056970.87405697 0.820771730.82077173 0.98120.9812 77 0.960173160.96017316 0.937950940.93795094 0.872696540.87269654 0.825757580.82575758 0.98160.9816 88 0.941414140.94141414 0.928447780.92844778 0.87343860.8734386 0.824314570.82431457 0.98160.9816

表1.准确率结果对比训练表Table 1. Accuracy results compared with training table

可以看出表中最高的是LC-CNN分类器模型的准确率，最低是决策树模型准确率。It can be seen from the table that the highest accuracy is that of the LC-CNN classifier model, and the lowest is that of the decision tree model.

模型对比训练的召回率结果如下表2所示，The recall rate results of the model comparison training are shown in Table 2 below.

表2.召回率结果对比训练表Table 2. Recall results compared to training table

可以看到LC-CNN分类器模型和贝叶斯模型的召回率都非常稳定，而决策树模型的召回率非常不稳定。It can be seen that the recall rates of the LC-CNN classifier model and the Bayesian model are very stable, while the recall rate of the decision tree model is very unstable.

将本发明的LC-CNN分类器模型和逻辑回归模型单独对比训练，如图5a所示，LC-CNN分类器模型在第2次迭代之后准确率提升至98％左右，远高于逻辑回归模型。逻辑回归交叉验证准确率最高只有96％，平均准确率是92.7％。虽然差距明显，但在对照组所有算法中，逻辑回归算法准确率位居第二，而LC-CNN分类器模型比其准确率更高，可见采用本发明的LC-CNN分类器模型的训练结果是非常优秀的。如图5b所示，在召回率方面，本发明的LC-CNN分类器模型在二次迭代之后召回率迅速抬升到94％以上，表示LC-CNN分类器对正例样本的覆盖率达到一个优秀的训练水平。而逻辑回归交叉验证召回率表现较差，大部分召回率在80％左右，最低时还跌到75.4％，平均召回率最低，相当于有五分之一的SQL注入语句未被检测到。因此，综合准确率和召回率来说，LC-CNN分类器模型有着更好的性能。The LC-CNN classifier model of the present invention and the logistic regression model are trained separately. As shown in FIG5a, the accuracy of the LC-CNN classifier model is increased to about 98% after the second iteration, which is much higher than the logistic regression model. The highest accuracy of the logistic regression cross-validation is only 96%, and the average accuracy is 92.7%. Although the gap is obvious, the accuracy of the logistic regression algorithm ranks second among all the algorithms in the control group, and the LC-CNN classifier model is higher than its accuracy. It can be seen that the training result of the LC-CNN classifier model of the present invention is very excellent. As shown in FIG5b, in terms of recall rate, the recall rate of the LC-CNN classifier model of the present invention is rapidly raised to more than 94% after the second iteration, indicating that the coverage rate of the LC-CNN classifier for positive samples has reached an excellent training level. The logistic regression cross-validation recall rate performs poorly, most of which are around 80%, and the lowest is 75.4%, with the lowest average recall rate, which is equivalent to one-fifth of the SQL injection statements not being detected. Therefore, in terms of comprehensive accuracy and recall rate, the LC-CNN classifier model has better performance.

将本发明的LC-CNN分类器模型和随机森林模型单独对比训练，如图6a所示，LC-CNN分类器模型的优势就更显而易见，随机森林模型最高的一次准确率只比LC-CNN分类器模型第一次迭代高了1％左右。随机森林模型算法对于分类问题有着相当不错的处理能力，但是面对本实验较大的数据，特征太多导致性能有所下降，平均准确率是90.8％。相比之下LC-CNN分类器模型准确率更好。如图6b所示，在召回率方面，随机森林模型的召回率波动太大，究其原因应是叶子节点对于特征数来说相对较少导致的。增加节点则会加深树高导致训练时间倍增。在此情况下，随机森林模型平均召回率84.2％只高于逻辑回归模型，综合准确率和召回率来说，本发明的LC-CNN分类器模型有着更好的性能。The LC-CNN classifier model of the present invention and the random forest model are trained separately. As shown in Figure 6a, the advantage of the LC-CNN classifier model is more obvious. The highest accuracy of the random forest model is only about 1% higher than the first iteration of the LC-CNN classifier model. The random forest model algorithm has a very good processing ability for classification problems, but in the face of the large data of this experiment, too many features cause the performance to decline, and the average accuracy is 90.8%. In contrast, the LC-CNN classifier model has a better accuracy. As shown in Figure 6b, in terms of recall rate, the recall rate of the random forest model fluctuates too much. The reason should be that the leaf nodes are relatively small for the number of features. Adding nodes will deepen the tree height and double the training time. In this case, the average recall rate of the random forest model is 84.2%, which is only higher than the logistic regression model. In terms of comprehensive accuracy and recall rate, the LC-CNN classifier model of the present invention has better performance.

将本发明的LC-CNN分类器模型和贝叶斯模型单独对比训练，如图7a可以看出，贝叶斯模型准确率稳定在87.4％，表明模型在给定的数据集上具有较好的表现，且对未训练的数据也具有良好的泛化能力，这意味着模型能够在不同的样本集上保持较高的性能水平，而不会出现误差过大或者不稳定的情况。但是在识别率仅比这五种算法中决策树模型的高一些，明显低于本发明的LC-CNN分类器模型。如图7b所示，贝叶斯算法使用了所有特征，并充分利用了各个特征之间的独立性假设，能够对SQL注入攻击样本进行全面的分类，从而在召回率上表现优秀，唯一缺点就是准确率并不高，综合准确率和召回率来说，本发明的LC-CNN分类器模型有着更好的性能。The LC-CNN classifier model of the present invention and the Bayesian model are trained separately. As shown in Figure 7a, the accuracy of the Bayesian model is stable at 87.4%, indicating that the model has a good performance on a given data set and has good generalization ability for untrained data, which means that the model can maintain a high performance level on different sample sets without excessive errors or instability. However, the recognition rate is only slightly higher than that of the decision tree model in these five algorithms, and is significantly lower than the LC-CNN classifier model of the present invention. As shown in Figure 7b, the Bayesian algorithm uses all features and makes full use of the independence assumptions between the features. It can comprehensively classify SQL injection attack samples, thereby performing well in recall rate. The only disadvantage is that the accuracy is not high. In terms of comprehensive accuracy and recall rate, the LC-CNN classifier model of the present invention has better performance.

将本发明的LC-CNN分类器模型和决策树模型单独对比训练，结合表1和图8a来看，决策树模型是所有模型对比训练中准确率最低的，平均准确率只有81.8％，因为随机森林是一种基于集成学习的算法，它结合了多个决策树，从而降低了单个决策树的过拟合风险，意味着决策树算法可能不能很好地泛化到新的数据，从而导致其在测试数据上表现不佳。参见图8b，决策树模型大部分召回率并不低，只有第七次召回率表现太差，只有66％，综合准确率和召回率来说，本发明的LC-CNN分类器模型在稳定性方面表现更优秀。The LC-CNN classifier model of the present invention and the decision tree model are trained separately. According to Table 1 and Figure 8a, the decision tree model has the lowest accuracy among all the model comparison trainings, with an average accuracy of only 81.8%, because random forest is an algorithm based on ensemble learning, which combines multiple decision trees, thereby reducing the overfitting risk of a single decision tree, which means that the decision tree algorithm may not be well generalized to new data, resulting in poor performance on the test data. Referring to Figure 8b, the recall rate of most decision tree models is not low, only the seventh recall rate is too poor, only 66%. In terms of comprehensive accuracy and recall rate, the LC-CNN classifier model of the present invention performs better in terms of stability.

通过采用本发明的LC-CNN分类器模型与四种对比机器学习模型的训练进行综合对比来看，只有精确率比贝叶斯模型稍低，这主要是贝叶斯模型对于数据类别不平衡问题有一定的鲁棒性，而本发明的LC-CNN分类器模型对不平衡数据更敏感，其他方面本发明的LC-CNN分类器模型的训练效果都优于对比机器学习模型。By comprehensively comparing the training of the LC-CNN classifier model of the present invention with four comparative machine learning models, it can be seen that only the accuracy is slightly lower than that of the Bayesian model. This is mainly because the Bayesian model has a certain robustness to the problem of data category imbalance, while the LC-CNN classifier model of the present invention is more sensitive to imbalanced data. In other aspects, the training effect of the LC-CNN classifier model of the present invention is better than that of the comparative machine learning models.

在本文中，术语“上”、“下”、“前”、“后”、“左”、“右”、“顶”、“底”、“内”、“外”、“竖直”、“水平”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了表达技术方案的清楚及描述方便，因此不能理解为对本发明的限制。In this document, the directions or positional relationships indicated by terms such as "up", "down", "front", "back", "left", "right", "top", "bottom", "inside", "outside", "vertical", and "horizontal" are based on the directions or positional relationships shown in the accompanying drawings and are only for the clarity of expressing the technical solutions and the convenience of description, and therefore should not be understood as limitations of the present invention.

在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，除了包含所列的那些要素，而且还可包含没有明确列出的其他要素。In this document, the terms "comprises," "comprising," or any other variations thereof, are intended to cover a non-exclusive inclusion of elements other than those listed and may also include additional elements not expressly listed.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention, which should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A SQL attack identification method based on deep learning dynamic target range feature fusion, characterized by comprising the following steps:

S1. Build a targeted server to simulate a real SQL injection attack and obtain injection attack samples;

S2, decoding and detecting the acquired injection attack sample data packets, obtaining the injection attack sentence text data set, and performing word segmentation processing on the injection attack sentences in the text data set;

S3. Combine the word frequency index and the SuperTerm_Vector word vector algorithm to process the text data set after word segmentation, and convert the text data injected into the attack sentence in the text data set into the corresponding numerical vector features;

S4, combining the numerical vector features of the injection attack statements obtained by S3 and the numerical vector features of the normal SQL query statements as a training data set, and inputting them into the LC-CNN model for classification model training, wherein the LC-CNN model is provided with two convolutional layers, the convolutional layers are connected to a flattening layer, and then to a fully connected layer;

S5. Use the LC-CNN model trained in S4 to identify the test data and determine whether there is a SQL attack.

2. The SQL attack identification method according to claim 1 is characterized in that: in the step S1, a simulation environment is established using the PHPStudy tool, a local sqli-libs range server is built, a SQLMAP interface function is used to detect and scan the WEB application in the local range server, the SQLMAP automation tool is used to simulate and automatically execute SQL injection attacks, and the real and effective injection data is captured by the Wireshark packet capture tool to obtain injection attack samples.

3. The SQL attack identification method according to claim 1 is characterized in that: in the step S2, the encrypted data in the injection attack sample data packet obtained by S1 is identified and decoded, and the injection attack sample data is judged in turn whether it belongs to the Base64 decoding format and the Unicode decoding format, and is decoded and converted into the UTF-8 encoding format to output the injection attack statement text, and the output injection attack statement text is simplified, the decimal numbers therein are converted into 0×12, the date and time are replaced with 1-1-1, only one rewritten keyword is retained, and the noise characters in the injection attack statement are deleted.

4. The SQL attack identification method according to claim 3 is characterized in that: step S2 uses a space segmentation method to segment the decoded injection attack statement, divides the SQL injection attack statement into a string sequence, and adds spaces before and after the string sequence.

5. The SQL attack identification method according to claim 1, characterized in that: in the step S3, the word frequency text frequency index is calculated by the following formula:

TF-IDF＝TF×IDF，

Where TF-IDF represents the term frequency index, TF(i) represents the frequency of word i in the text, IDF(i) represents the importance index of word i, Total(i) represents the frequency of word i in the injected attack sentence, Total represents the total number of words in the injected attack sentence, T(i) represents the frequency of the sentence containing word i, and φ represents the offset.

6. The SQL attack identification method according to claim 5 is characterized in that: the SuperTerm_Vector word vector algorithm represents each word in the injected attack sentence as a vector and maps the input text data into a multidimensional space, including the following process:

Build a corpus, collect public text data as the corpus, and preprocess the text data in the corpus, including word segmentation, punctuation removal, conversion to lowercase, and stop word removal to obtain clean text data;

Build a vocabulary from the preprocessed corpus, containing all the unique words that appear in the corpus;

Construct training samples based on the vocabulary. Each training sample consists of a central word and its surrounding context words.

Define the model structure, select the Skip-gram model to predict the surrounding context words through the central word, input the training samples into the defined Skip-gram model structure for training, and use the negative log-likelihood loss function to maximize the probability of predicting the context words to adjust the model parameters during the training process;

Obtain word vectors. Input the text dataset after word segmentation in step S2 into the trained Skip-gram model. Each word in the text dataset is mapped to the word vector space. Each word is represented as a length vector of the corresponding mapping. According to the principle that words with close similarity in the semantic space in the injection attack sentence are also close in distance in the vector space, the distance of the spatial word vectors is calculated in the following way:

The injection attack sentence contains n words w ₁ , w ₂ , w, ..., w _n . The word vector list of all words in the attack sentence in the SuperTerm_Vector algorithm is represented as υ ₁ , υ ₂ , υ ₃ , ..., υ _n . The cosine similarity (υ _μ , υ _v ) of two word vectors υ _μ and υ _v is calculated using the following formula:

Where represents the dot product of the vectors, ||υ _μ || and |υ _v || represent the norms of the word vectors υ _μ and υ _v , respectively. The similarity between different words in the injected attack sentence is calculated based on the cosine similarity, and the distance between different words in the word vector space is obtained accordingly.

7. The SQL attack identification method according to claim 6 is characterized in that: for each word in the injection attack sentence, the corresponding spatial word vector is multiplied by TF-IDF to obtain the weighted word vector V′ _W of each word, and the average weighted word vector of the injection attack sentence is expressed as n represents the number of words contained in the injection attack sentence, W represents the words contained in the injection attack sentence, and D represents the injection attack sentence. The average weighted word vector of the injection attack sentence is used as the numerical vector feature to train the LC-CNN model for classification.

8. The SQL attack identification method according to claim 1 is characterized in that: in the step S4, the convolution layer uses the ReLU function as the activation function, the fully connected layer uses the Sigmoid function as the activation function, and the flattened layer uses the Tanh function as the activation function.

9. The SQL attack identification method according to claim 7 is characterized in that: the LC-CNN model adds a model.compile() function to define the optimizer, loss function and evaluation index of the model, the optimizer adopts the Adam optimizer, the loss function uses binary cross entropy, the evaluation index includes accuracy, F1 value and confusion matrix, and the optimal model parameter setting is determined by determining the iteration with the highest classification accuracy during multiple iterative training of the classification model.

10. A SQL attack identification system based on deep learning dynamic target range feature fusion, characterized by comprising:

Data collection module, builds a target server to simulate real SQL injection attacks and obtain injection attack samples;

The data cleaning module decodes and detects the acquired injection attack sample data packets, obtains the injection attack sentence text data set, and performs word segmentation processing on the injection attack sentences in the text data set;

The feature extraction module combines the word frequency index and the SuperTerm_Vector word vector algorithm to process the text data set after word segmentation, and converts the text data injected into the attack sentence in the text data set into corresponding numerical features;

The LC-CNN classifier has a built-in LC-CNN model. The LC-CNN model has two convolutional layers, which are connected to a flattening layer and then to a fully connected layer. The classification model is trained by inputting a training data set composed of the numerical features of the injection attack statements of the feature extraction module and the numerical vector features of the normal SQL query statements. The trained LC-CNN model is then used to identify the test data to determine whether there is an SQL attack.