CN109151880B

CN109151880B - Mobile application flow identification method based on multilayer classifier

Info

Publication number: CN109151880B
Application number: CN201811326852.4A
Authority: CN
Inventors: 赵双; 陈曙晖; 孙一品; 王飞; 苏金树
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2021-06-22
Anticipated expiration: 2038-11-08
Also published as: CN109151880A

Abstract

The invention belongs to the field of network traffic analysis, and provides a mobile application traffic identification method based on a multilayer classifier aiming at the problem that the existing mobile application traffic identification method cannot detect and process background traffic, wherein the technical scheme is as follows: firstly, extracting the characteristics of a flow training set to obtain the characteristic representation of a flow sample; secondly, training a first-layer classifier, and preliminarily detecting a sample to be detected as target flow or background flow; training a second-layer classifier, and performing fine-grained identification on the target flow; fourthly, training a third-layer classifier; and fifthly, carrying out mobile application flow identification on the sample to be detected by using the trained multilayer classifier. The invention fully considers the flow distribution condition in a real network, and under the condition of not having a complete background flow data set, the characteristics of the target flow sample are learned layer by layer, so that the classifier can identify the target flow and simultaneously eliminate the background flow, and the false positive number of the classifier is reduced.

Description

Mobile application flow identification method based on multilayer classifier

Technical Field

The invention belongs to the field of network traffic analysis, relates to a network traffic identification method based on machine learning, and particularly relates to a mobile application traffic identification method based on a multilayer classifier.

Background

With the popularization of mobile devices and the prosperous development of mobile applications, mobile applications have become the most common way for people to surf the internet. By the first quarter of 2018, the google application market has 380 ten thousand applications available for users to download, and on average 6,140 new applications on a daily basis. By 2017, 57% of the network traffic came from the mobile device. Thus, mobile network traffic has become a major component of network traffic over traditional workstation traffic. Research focus has also shifted from traditional workstation traffic identification to mobile network traffic identification.

The goal of mobile network traffic identification technology is to identify the source of an application for mobile traffic. The technology plays an important role in the fields of network management and security, market research, user analysis and the like. For example, based on this technology, a service provider can keep track of mobile application traffic distribution in the network; a network administrator can acquire popular network applications in the campus and optimize the allocation of related network resources to improve user experience; the advertisement provider can know when and where a certain application is more popular with users, so as to make a more reasonable advertisement putting strategy, and the like.

Although mobile network traffic identification is similar to the traditional desktop traffic identification process, the specificity of mobile traffic presents a huge challenge to traditional traffic identification technology:

1) the mobile application traffic is mostly transmitted by adopting an HTTP/HTTPS protocol, so that the port-based traffic identification technology can only identify the mobile application traffic as Web. Other transmission ports are mostly random port numbers, so that the technology completely fails.

2) In order to protect the privacy of users, the mobile traffic is mostly transmitted by adopting an encryption protocol, so that the effectiveness of a traffic identification technology based on Deep Packet Inspection (DPI) (deep Packet inspection) is reduced.

3) The mobile application uses a third party library more, resulting in similar traffic from different applications that is difficult to identify using DPI technology and IP addresses.

4) CDN (Content Distribution Network) is a technology commonly used for mobile applications. This technique results in the possibility that the IP address of one server may serve different applications simultaneously. Thus reducing the effectiveness of DNS (Domain name service) based traffic identification techniques. In addition, some applications may not use DNS to obtain server addresses, further reducing the applicability of DNS-based traffic identification techniques.

5) The number of mobile applications is huge, the updating is fast, emerging applications are infinite, and the identification technology needs to be updated continuously, for example, the DPI technology needs to update the load feature library continuously.

For the above reasons, the conventional traffic identification method has not been able to effectively process the mobile traffic. In recent years, the traffic identification technology based on machine learning shows good classification performance in traditional desktop network traffic identification, so that the traffic identification technology is also applied to a mobile application traffic identification task.

Wang et al (Wang et al, I know what your cell phone is doing: presume mobile application usage by encrypting traffic) IEEE Conference on Communications and Network Security (IEEE Communications and Network Security Conference), 2015, 433 and 441) manually collect traffic for 5 minutes of each running of applications under 13 iOS systems, and train a random forest classifier. This work uses too few samples and therefore it is difficult to assess the effectiveness of the method. Appccanner (Vincent et al, Robust smart phone app identity Information encrypted network traffic analysis, Robust mobile application identification method based on encrypted network traffic analysis), IEEE Transactions on Information Forenses & Security, 2017, 13(1): 63-78) extracts application fingerprints and identifies traffic using a random forest algorithm. The data set used by it comes from traffic generated by 110 applications on two different android devices. But the work adopts 'burst' (namely a group of packets with the interval time less than a certain threshold value in a certain time) as the basic object of traffic identification, so that the method is only suitable for the traffic identification work of a simple network. Wang et al (Wang et al, End-to-End encrypted traffic classification with one-dimensional convolutional neural network (based on the End-to-End encrypted traffic identification method), IEEE International Conference on Intelligent and Security information (IEEE information and safety information Conference), 2017, 43-48) identified traffic using a one-dimensional convolutional neural network model, when traffic was classified at a fine granularity, the accuracy reached 86.6%. Deep packets (Deep packets: an encryption traffic classification method based on Deep learning), arXiv,2017) classify mobile application traffic based on a one-dimensional convolutional neural network and a stacked automatic encoder. The four identification methods based on neural networks are compared by Giuseppe et al (Giuseppe et al, Mobile encrypted traffic classification based on deep learning, 2018) to indicate that the classifier proposed by Wang et al (Wang et al, End-to-End encrypted traffic classification with one-dimensional convolutional neural networks), IEEE International Conference information and Security information (IEEE and Security information Conference), 2017, 43-48, has the optimal Mobile application traffic identification performance.

In summary, although the above proposed mobile application traffic identification methods all show excellent identification results, none of these methods considers the influence of unknown background traffic on the performance of the classifier, and the classifier is tested only in a closed environment, i.e. the traffic of the test set is from the applications involved in the training set. In a real network, besides target application traffic, thousands of unknown applications and emerging applications are in a large range, and background traffic generated by these non-target applications can bring a great challenge to the classifier. The test environment of the above methods does not consider the problem, so that the methods cannot be deployed in a real network environment.

Disclosure of Invention

Aiming at the problem that the existing mobile application flow identification method based on machine learning cannot detect and process background flow, the invention provides a mobile application flow identification method based on a multilayer classifier, which learns the characteristics of target flow samples layer by layer, so that the classifier can eliminate the background flow while identifying the target flow, and the false positive number of the classifier is reduced.

The technical scheme is as follows:

firstly, extracting the characteristics of a flow training set to obtain the characteristic representation of a flow sample. Each flow sample is denoted as a stream.

And secondly, training a first-layer classifier, and preliminarily detecting the sample to be detected as target flow or background flow. Note that the Target traffic is of Target class and the background traffic is of Other class.

And thirdly, extracting the fuzzy flow, constructing a training set of a second-layer classifier, then training the second-layer classifier, and carrying out fine-grained identification on the target flow. Fuzzy flows refer to similar traffic generated by multiple applications simultaneously, such as third party library or advertisement traffic. Let the ith target application be api. The number of target applications is N, and N is a natural number.

And fourthly, re-extracting the background flow sample, constructing a training set of a third-layer classifier, and then training the third-layer classifier.

And fifthly, carrying out mobile application flow identification on the sample to be detected by using the trained multilayer classifier. The method comprises the following steps: firstly, identifying a flow sample to be detected as a Target class or an Other class by using a first-layer classifier, and enabling a flow sample identified as the Target class to enter a second-layer classifier for continuous detection; then, the second-layer classifier identifies the Target class as a certain Target application or fuzzy stream in a fine-grained manner, if a certain sample is identified as a Target application api, the Target class enters a third-layer classifier, and the relevant classifier continues to identify the Target class; and when the third-layer classifier gives a consistent recognition result, giving a final recognition result, and otherwise, refusing to judge.

As a further improvement of the technical solution of the present invention, the first step of extracting the features of the traffic training set includes: the original flow is grouped according to the five-tuple < source IP, destination IP, source port, destination port, protocol > to form flow. If the number of the messages with the load not being 0 contained in one flow is less than or equal to five, extracting corresponding 29 flow characteristics according to the whole flow; if the number of the messages with the load not being 0 contained in one flow is larger than five, only the first five messages with the load not being 0 are taken to extract the corresponding 29 kinds of flow characteristics, and the other messages behind the fifth message with the load not being 0 are discarded.

As a further improvement of the technical solution of the present invention, the 29 traffic characteristics respectively include a destination port, the first 16 payload bytes of a flow, and 12 statistical characteristics. The 12 statistical characteristics are the maximum value and the minimum value of the message size from the client to the server, the maximum value, the minimum value, the average value and the variance value of the message size from the server to the client, the load size of the first 3 messages with non-0 load from the client to the server, the load size of the messages with non-0 load from the server to the client 1 st and the 3 rd, and the minimum packet size of the stream.

As a further improvement of the technical solution of the present invention, the second step of training the first-layer classifier specifically comprises: labels of the training set samples are divided into a Target class and an Other class, and a random forest binary classifier is trained. During training, the weight of the target flow sample is increased, so that the classifier preferentially learns the characteristics of the target sample.

As a further improvement of the technical solution of the present invention, the features of the training set samples in the second step are ports and 12 statistical features.

As a further improvement of the technical solution of the present invention, the third step of training the second-layer classifier specifically includes:

step 3.1, extracting fuzzy flow, wherein the specific method comprises the following steps: first, a traffic sample set is grouped by a binary < server IP, server port >. For each packet, if the traffic samples in the packet have multiple application labels and no application sample of a certain class is dominant, then the traffic in the group is a fuzzy flow. In the present invention, a class of applications is considered to be dominant when the number of samples is more than 90% of the number of grouped samples.

Step 3.2, constructing a training sample set, wherein the specific method comprises the following steps: and (3) extracting flow samples of each target application left after the fuzzy flow is extracted from the original training set, and combining the fuzzy flow samples extracted in the step 3.1 to form a training sample set of the second-layer classifier, wherein the training sample set comprises N +1 samples.

And 3.3, training a second layer of N + 1-element random forest classifier.

As a further improvement of the technical solution of the present invention, the characteristics of the training set samples in step 3.3 are ports and 12 statistical characteristics.

As a further improvement of the technical solution of the present invention, the fourth step of training the third-layer classifier specifically includes:

step 4.1, extracting Other samples, wherein the specific method comprises the following steps: and classifying the original training set by using the first-layer classifier trained in the second step, and extracting Other samples which are wrongly classified into Target classes to form a new Other data set.

Step 4.2, constructing a training sample set, wherein the specific method comprises the following steps: combining the Other data set extracted in the step 4.1 with the training set in the third step to form a training sample set of the third-layer classifier, wherein the training sample set comprises N +2 types of samples, namely N types of target application samples, fuzzy stream types and Other types.

Step 4.3, training a third-layer classifier, wherein the specific method comprises the following steps: random forest and XGboost models (Chen et al, XGboost: A Scalable Tree Boosting System), ACM International Conference on Knowledge Discovery and Data Mining (ACM International Conference on Data Mining), 2016, 785-. For each model, one-to-one (zhongshihua, machine learning, 2016, 63-66) method was based, i.e., one binary classifier was trained between any two classes, (N +2) × (N +1)/2 binary classifiers were trained. The final third tier of classifiers contains (N +2) × (N +1) classifiers. The training set is characterized by the first 16 payload bytes.

As a further improvement of the technical solution of the present invention, the fifth mobile application traffic identification specific method is as follows:

step 5.1, the first-layer classifier classification is specifically as follows: the ports and 12 statistical features of the traffic samples are extracted and identified using a first-tier classifier. If Target is identified, go to step 5.2. Otherwise, judging the flow as background flow, and ending.

Step 5.2, classifying by a second-layer classifier, wherein the specific method comprises the following steps: the ports and 12 statistical features of the traffic samples are extracted and identified using the second-tier classifier. If the target application Appi is identified, entering step 5.3; if a fuzzy flow is identified, the process ends and no specific application label is given.

Step 5.3, the third-layer classifier classification specifically comprises the following steps: the first 16 bytes of the flow sample were extracted and identified using the third layer of 2 x (N +1) classifiers. And when the identification results of the 2 x (N +1) classifiers are consistent with the step 5.2, judging as Appi, otherwise, ending, and not giving a specific application label.

As a further improvement of the technical scheme of the present invention, the 2 × N +1 classifiers in the step 5.3 are N +1 random forest classifiers and N +1 XGBoost classifiers, respectively. The N +1 binary random forest classifiers comprise a binary classifier trained by a training set formed by an Appi sample and an Appj sample (j is not equal to i), a binary classifier trained by a training set formed by an Appi sample and an Other sample, and a binary classifier trained by a training set formed by an Appi sample and a fuzzy stream sample. The N +1 binary XGboost classifiers comprise a binary classifier trained by a training set formed by an Appi sample and an Appj sample (j is not equal to i), a binary classifier trained by a training set formed by an Appi sample and an Other sample, and a binary classifier trained by a training set formed by an Appi sample and a fuzzy stream sample.

Compared with the prior art, the invention has the beneficial effects that:

since unknown applications are in the thousands and new applications are in endless numbers, this makes it impossible to collect a complete background traffic data set. The classifier is therefore unable to learn all background traffic patterns and thus is unable to effectively exclude unlearned background traffic. Under the condition that the classifier designed by the invention does not have a complete background flow data set, the classifier has the capability of eliminating non-target samples by better learning the target samples layer by layer, and the influence of unknown flow on the performance of the classifier is relieved;

the invention fully considers the traffic distribution condition in a real network, and the proposed multi-layer device can effectively detect a large amount of background traffic existing in the network, and has certain guiding significance for deploying the mobile application traffic identification method based on machine learning in practice.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow chart of classifier identification of the present invention;

FIG. 3 is a graph of the accuracy and recall of the second and third level classifiers in accordance with an embodiment of the present invention;

FIG. 4 is a comparison of classifier accuracy and recall with or without fuzzy flow detection in an embodiment of the present invention;

FIG. 5 is a comparison of classifier pseudo-positive numbers with or without fuzzy flow detection in an embodiment of the present invention;

FIG. 6 is a comparison of classifier performance for different decision thresholds in an embodiment of the present invention;

FIG. 7 is a comparison of classifier accuracy in an embodiment of the present invention;

FIG. 8 is a comparison of classifier recall rates in an embodiment of the present invention;

FIG. 9 is a pseudo-positive comparison of classifiers in an embodiment of the invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to examples.

As shown in fig. 1, the mobile application traffic identification method based on the multi-layer classifier of the present invention includes the following steps:

in the first step, flow training set features are extracted, that is, each sample is represented by a feature, and the total number of the features is 29.

And secondly, training a first-layer classifier. And dividing the training data set samples into Target and Other classes, and training a binary random forest classifier. The training results are the first layer classifier in fig. 2.

And thirdly, training a second-layer classifier. Firstly, extracting fuzzy flow, constructing a training set of a second-layer classifier, containing N +1 samples in total, training an N + 1-element random forest classifier, and identifying target flow on fine granularity. The training results are shown as the second layer classifier in fig. 2.

And fourthly, training a third-layer classifier. Firstly, re-extracting background flow samples, constructing a training set of a third-layer classifier, wherein the training set comprises N +2 types of samples, then training the third-layer classifier, and generating (N +2) × (N +1)/2 classifiers for each type of model. The training results are shown in the third level base classifier pool of fig. 2.

And fifthly, identifying the sample to be detected. As shown in fig. 2, for a sample of a flow to be detected, a first-layer classifier is first used for identification, and the sample is identified as a Target class or an Other class. At this time, the traffic sample identified as the Target class enters the second-layer classifier to continue detection. These samples include target flow and background flow. The second layer classifier identifies the Target class as a certain Target application or fuzzy flow at a fine granularity. If a certain sample is identified as the target application, Appi, the identification is continued by the relevant classifier at the third layer. And when the classifiers of the third layer give consistent recognition results, giving final recognition results, and otherwise, refusing to judge.

The invention adopts real network flow to test and evaluates the effectiveness of the invention.

1) Data set

Mobile application traffic generated by 12 users in the last three months is collected locally. The related mobile equipment brands comprise Huashi, millet, Samsung and the like, the covered mobile applications are 160, and the network environment generated by traffic comprises 2G, 3G, 4G and a wireless network. The final collected traffic was divided into two data sets, the relevant data is as in table 1.

TABLE 1 data set details

Data set 1 was used to train and test the proposed three-layer classifier. Wherein, the number of the flow samples of 7 applications exceeds 5000, and is selected as the target application, the remaining 131 applications are used as the non-target applications, and the related flow is the background flow. Data set 2 was used only to test the three-tier classifier and it contained 22 applications not present in data set 1, accounting for 5569 flow samples. Therefore, these 22 applications can be considered as emerging applications. The detailed composition of the two data sets is shown in table 2.

TABLE 2 data set 1 and data set 2 constitute

2) Experimental setup

The classifier is implemented using a Scikit-learn machine learning algorithm library, and the three-layer classifier is compared with a current best-effect method, one-dimensional convolutional neural network classifier (abbreviated as 1D-CNN) (Wang et al, End-to-End encrypted traffic classification with one-dimensional convolutional neural networks (End-to-End encrypted traffic identification method based on one-dimensional convolutional neural networks), IEEE International Conference on Intelligence and Security information (IEEE info and Security information Conference), 2017, 43-48), and a single random forest reference classifier. The reference classifier is a random forest classifier of class N +1, comprising 30 trees with a maximum depth of 20. And when 1D-CNN is realized, extracting the first 784 load construction one-dimensional vectors of each flow as the input of the model, and training an N +1 class 1D-CNN classifier by using a Keras library. Wherein, the parameters of the one-dimensional convolution neural network model are consistent with the original work. The fuzzy flow is not applied to the basis classifier and the one-dimensional convolutional neural network classifier. For a three-tier classifier, the random forest models of the first two tiers each contain 30 trees with a maximum tree depth of 20. Each random forest model of the third level comprises 20 trees with a maximum depth of 20. Each XGBoost model includes 10 trees with a maximum tree depth of 5.

Five evaluation indexes of true positive number TP (true positive), false positive number FP (false positive), false negative number FN (false negative), Precision (Precision) and Recall (Recall) are used for evaluating the performance of the proposed classifier.

3) Testing of the Properties of the layers

Each layer classification performance of the proposed multi-layer classifier was tested using data 1. First, data set 1 is as follows 7: the ratio of 3 is randomly divided into a training set and a test set. Then trained and tested 10 times, giving the average result. For the first layer classifier, the precision and recall rate were 12.21% and 99.40%, respectively. This result is consistent with the expectation that only a small amount of background traffic can be excluded and the target application traffic will be identified with low accuracy but with high recall. The accuracy and recall of the second and third tier classifiers for each application are shown in fig. 3. Fig. 3 shows that the classifier has very high precision and shows excellent rejection capability for background traffic. But at the same time the recognition recall rate of the second-layer classifier for the last three classes of applications is very low. This is because in the second layer classifier, 57.29%, 47.43% and 33.97% of the test samples of the last three classes of applications were determined to be fuzzy flows. By careful inspection of the training data set of the second-tier classifier, the last three applications were found to be strongly connected with other non-target applications in data set 1. For example, QQ is a popular instant messaging software and integrates a very large number of functions such as news push, mail management, and music playing. However, these functions have corresponding independent applications, namely flight news, QQ mailboxes, and QQ music. This results in the QQ-generated traffic having a high probability of having similar or identical characteristics to other background application traffic. To reduce false positives, the classifier will preferentially identify them as fuzzy streams without giving detailed classes, so that the QQ traffic has a low recall rate for identification. A similar situation exists for panning and hundredths.

4) Fuzzy flow detection test

From the above experiments, it can be seen that the independent recognition of fuzzy flows has a great influence on the recall rate of the second-tier classifier. Thus, the experiment compares the performance of the multi-tier classifier with or without fuzzy flow detection. The data set used was data set 1 and the experimental procedure was the same as described above. The final multi-level classifier performance comparison is shown in fig. 4 and 5, wherein fig. 4 is a classifier accuracy and recall ratio comparison with fuzzy flow detection, and fig. 5 is a classifier pseudo-positive comparison with fuzzy flow detection.

As can be seen from fig. 4 and 5, when there is no fuzzy stream detection, the recall rate of each application increases, and particularly, the recall rates of the latter three applications are greatly increased. However, the corresponding false positive number increases, and the recognition accuracy is slightly degraded. When the false positive decision tolerance for a classifier is low, a classifier with fuzzy stream detection may be selected. When high recall is pursued, the fuzzy stream detection can be removed.

5) Third tier classifier threshold testing

The third level classifier uses a plurality of binary classifiers to learn characteristics of the target application traffic. If the label of the flow is determined when all relevant binary classifiers give consistent judgment results, the method can lead the identification precision of the classifiers to be high, the false positive elimination capability is strong, but the true judgment of the second layer is probably easy to be eliminated, and the recall rate of the classifiers is lowered. Different decision thresholds of the third level classifier are therefore tested and compared here. The experimental setup was consistent with the first two experiments and the comparison is shown in figure 6. The value of the RF in the graph indicates that at least several random forest models give classification results consistent with the second layer, and the value of the XG indicates that at least several XG boost classifiers give classification results consistent with the second layer classifier, so that the label of one flow can be determined.

As can be seen from fig. 6, as the determination conditions become stricter, the accuracy of the classifier gradually increases from 80% to 99%, and the recall rate gradually decreases from 66% to 53%. The increase in the value of XG and the increase in the value of RF cause an increase in accuracy and a decrease in recall rate. When the requirement on the recall rate is higher, the value of XG or RF can be properly reduced, and when the requirement on the precision is higher, the value of XG or RF can be increased.

6) Classifier comparison

The experiment compares the proposed three-layer classifier with a reference classifier and a one-dimensional convolutional neural network classifier (1D-CNN). For the proposed three-level classifier, fuzzy flow detection is used, and the verification condition of the third-level classifier is set to the most strict case, i.e. the recognition results of all relevant classifiers must be consistent with the recognition result of the second-level classifier.

(1) Data set 1 testing

Data set 1 was as follows 7: the ratio of 3 is randomly divided into a training set and a testing set, and the average value of the testing result is given after 10 times of training and testing. The three classifiers that are compared include a base classifier composed of a single random forest model, a one-dimensional convolutional neural network model (1D-CNN) proposed by Wang et al (Wang et al, End-to-End encrypted traffic identification method based on one-dimensional convolutional neural network), IEEE International Conference on intelligent and Security information (IEEE information and Security information Conference), 2017, 43-48), and the three-layer classifier of the present invention. The comparison results are shown in the first four rows of table 3, and the recognition accuracy, recall ratio and pseudo-positive number for each application are shown in fig. 7, 8 and 9, respectively.

TABLE 3 comparison of the Classification Performance of the four classifiers

The results show that the proposed classifier has the highest accuracy, achieves near 99% of recognition accuracy, and generates far lower false positive numbers than the other two classes of classifiers. Compared with the base classifier, the pseudo-negative number generated by the three-layer classifier provided by the invention is reduced by 94%, which indicates that the third-layer classifier has excellent background flow detection capability. But the average recall rate of the proposed multi-layer classifier is much lower than that of the base classifier because the recognition recall rate for the latter three applications is extremely low. When the identification target is the application coverage rate, the low recall rate does not affect the identification result, but the false positive judgment is required to be as less as possible, so that the method is very suitable for the application of the scenes. When the identification target is the stream coverage rate, the method still needs to be improved in recall rate, but if a scene with a certain requirement on identification precision exists, the method still has great advantages in identification precision.

In addition, note that the latter three applications with low recall are lower than the first four samples in training samples, so the classifier is retrained and various samples are sampled during training to equalize the number of training samples by using SMOTEENN method (Batista et al, study of the balance of training data for basic learning), ACM Sigkdd extensions Newster (ACM SIGKDD exploration communication), 2004, 20-29). Thus examining the impact of the number of samples on the classification performance of the model. The re-trained classifier is named as "three-level classifier + SMOTEENN", and the recognition results thereof can be seen in the last line of Table 3 and FIGS. 6-8. It can be seen that the number of samples is not the primary reason for the impact on classifier recall. Although a certain recall rate is improved by equalizing samples, the negative influence on the identification precision is greater.

(2) Data set 2 testing

The classifier is trained by using the data set 1, and the classifier is tested by using the data set 2 as a test set, so that the recognition performance of the classifier is further verified. The test results are shown in table 4.

The results shown in table 4 are similar to the test results of data set 1, and the three-layer classifier proposed by the present invention has the best background flow detection capability compared to the other two classifiers. On data set 2, the three-tier classifier yielded a total of 152 false positive decisions, of which 45 streams came from the emerging application, excluding about 99.2% of the total unknown traffic. In contrast, the random forest produced 1478 false positive judgments, with 403 streams from emerging applications; 1D-CNN generated 3963 false positives, with 1348 streams from emerging applications. The remaining 107 false positive decisions are then analyzed in detail. For 31 streams that were misclassified as Tencent video, 3 of them were from QQ music and 28 were from Tencent news. QQ music, Tencent news and Tencent video are all application software of Tencent companies, and access to the same resources is extremely easy and possible due to functional requirements. In addition, for QQ, there are 46 false positives from the Tencent map, Tencent microblog, etc., similar to the Tencent video false positives. And (4) judging 1 false positive of the pinyin for the dog search, checking the IP address of the stream, and comparing the IP address with other streams with the same IP address in the training set to find that all labels of related streams in the training set are the pinyin for the dog search. Therefore, it is highly likely that this false positive sample will be the sample label with the error.

TABLE 4 data set 2 test results

Therefore, by learning the characteristics of the target application flow layer by layer in a refined manner, the method enhances the detection capability of the classifier on the background flow, so that the classifier can detect the flow which is never learned. Experimental results show that the method has extremely high identification precision, can effectively detect background flow generated by unknown application and emerging application, and has great application advantages in identification scenes needing application coverage.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. The mobile application traffic identification method based on the multilayer classifier is characterized by comprising the following steps of:

firstly, extracting the characteristics of a flow training set to obtain the characteristic representation of flow samples, and recording each flow sample as a flow; the specific process is as follows: firstly, grouping original flow according to five-tuple < source IP, destination IP, source port, destination port and protocol > to form flow; if the number of the messages with the load not being 0 contained in one flow is less than or equal to five, extracting corresponding 29 flow characteristics according to the whole flow; if the number of the messages with the load not being 0 contained in one flow is larger than five, only the first five messages with the load not being 0 are taken to extract corresponding 29 kinds of flow characteristics, and other messages behind the fifth message with the load not being 0 are discarded;

secondly, training a first-layer classifier, and preliminarily detecting a sample to be detected as target flow or background flow; recording the Target flow as Target class and the background flow as Other class; dividing labels of training set samples into Target classes and Other classes, and training a random forest binary classifier; during training, the weight of the target flow sample is increased, so that the classifier preferentially learns the characteristics of the target sample;

thirdly, extracting fuzzy flow, constructing a training set of a second-layer classifier, then training the second-layer classifier, and carrying out fine-grained identification on target flow; fuzzy flows refer to similar flows produced by multiple applications simultaneously; marking the ith target application as Appi; the number of the target applications is N, and N is a natural number; the specific method comprises the following steps:

step 3.1, extracting fuzzy flow, wherein the specific method comprises the following steps: firstly, grouping a flow sample set according to a binary group < server IP, server port >; for each packet, if the traffic sample in the packet has multiple application labels and no application sample of a certain class is dominant, then the traffic in the packet is a fuzzy flow;

step 3.2, constructing a training sample set, wherein the specific method comprises the following steps: extracting flow samples of each target application left after the fuzzy flow is extracted from the original training set, and combining the fuzzy flow samples extracted in the step 3.1 to form a training sample set of a second-layer classifier, wherein the training sample set comprises N +1 samples;

step 3.3, training a second layer of N + 1-element random forest classifier;

fourthly, re-extracting the background flow sample, constructing a training set of a third-layer classifier, and then training the third-layer classifier; the specific method comprises the following steps:

step 4.1, extracting Other samples, wherein the specific method comprises the following steps: classifying the original training set by using a first-layer classifier trained in the second step, extracting Other samples which are wrongly classified into Target classes, and forming a new Other data set;

step 4.2, constructing a training sample set, wherein the specific method comprises the following steps: combining the Other data set extracted in the step 4.1 with the training set in the third step to form a training sample set of a third-layer classifier, wherein the training sample set comprises N +2 types of samples, namely N types of target application samples, fuzzy stream types and Other types;

step 4.3, training a third-layer classifier, wherein the specific method comprises the following steps: selecting a random forest and an XGboost model to train a third-layer classifier; for each model, (N +2) × (N +1)/2 binary classifiers are trained based on a one-to-one method;

fifthly, performing mobile application flow identification on the sample to be detected by using the trained multilayer classifier, wherein the method comprises the following steps: firstly, identifying a flow sample to be detected as a Target class or an Other class by using a first-layer classifier, and enabling a flow sample identified as the Target class to enter a second-layer classifier for continuous detection; then, the second-layer classifier identifies the Target class as a certain Target application or fuzzy flow in a fine-grained manner, if a certain sample is identified as the Target application, the Target class enters a third-layer classifier, and the relevant classifier continues to identify the Target class; and when the third-layer classifier gives a consistent recognition result, giving a final recognition result, and otherwise, refusing to judge.

2. The mobile application traffic identification method based on the multi-layer classifier as claimed in claim 1, wherein the fifth step mobile application traffic identification is as follows:

step 5.1, the first-layer classifier classification is specifically as follows: extracting ports and 12 statistical characteristics of flow samples, and identifying by using a first-layer classifier; if the Target is identified, entering step 5.2; otherwise, judging the flow as background flow, and ending;

step 5.2, classifying by a second-layer classifier, wherein the specific method comprises the following steps: extracting ports and 12 statistical characteristics of the flow samples, and identifying by using a second-layer classifier; if a certain target application is identified, entering a step 5.3; if the fuzzy flow is identified, ending;

step 5.3, the third-layer classifier classification specifically comprises the following steps: extracting the first 16 bytes of the flow sample, and identifying by using a third layer of 2 x (N +1) classifiers; when the recognition results of the 2 x (N +1) classifiers are consistent with the step 5.2, determining as Appi; otherwise, ending.

3. The multi-layered classifier based mobile application traffic recognition method of claim 1, wherein the 29 traffic features respectively include destination port, first 16 payload bytes of flow, and 12 statistical features; the 12 statistical characteristics are the maximum value and the minimum value of the message size from the client to the server, the maximum value, the minimum value, the average value and the variance value of the message size from the server to the client, the load size of the first 3 messages with non-0 load from the client to the server, the load size of the messages with non-0 load from the server to the client 1 st and the 3 rd, and the minimum packet size of the stream.

4. The multi-tier classifier based mobile application traffic recognition method of claim 1 wherein the training set samples are characterized by ports and 12 statistical features; the 12 statistical characteristics are the maximum value and the minimum value of the message size from the client to the server, the maximum value, the minimum value, the average value and the variance value of the message size from the server to the client, the load size of the first 3 messages with non-0 load from the client to the server, the load size of the messages with non-0 load from the server to the client 1 st and the 3 rd, and the minimum packet size of the stream.

5. The multi-layer classifier-based mobile application traffic identification method according to claim 2, wherein 2 x (N +1) classifiers in step 5.3 are N +1 random forest classifiers and N +1 XGBoost classifiers, respectively; the N +1 binary random forest classifiers comprise a binary classifier trained by a training set formed by an Appi sample and an Appj sample, j is not equal to i, a binary classifier trained by a training set formed by the Appi sample and an Other sample and a binary classifier trained by a training set formed by the Appi sample and a fuzzy stream sample; the N +1 binary XGboost classifiers comprise a binary classifier trained by a training set formed by an Appi sample and an Appj sample, a binary classifier trained by a training set formed by an Appi sample and an Other sample and a binary classifier trained by a training set formed by an Appi sample and a fuzzy stream sample, wherein j is not equal to i.