Keywords

1 Introduction

Modern malware imitates benign http traffic to evade detection. To detect unseen malicious traffic, a linguistic-based detection method for proxy logs has been proposed [1]. This method uses Paragraph Vector to extract features automatically. In the previous work, this method performed good accuracy in the balanced datasets. To generate discriminative feature representation with Paragraph Vector, a balanced corpus is required. If a corpus contains too much of words and too little of others, an imbalance occurs. Because, the large corpus occupies most of the whole corpus. Such a corpus cannot generate discriminative feature representation. This method expects that the corpus is extracted from actual benign proxy logs and malicious pcap files. Actual proxy logs contain a huge amount of words. To represent a huge amount of words, a corpus requires long-term and much traffic. In contrast, malicious pcap files contain only limited words. Thus, benign traffic is dominant, and occupies malicious feature representation. This is due to the base-rate fallacy phenomenon [2]. Therefore, this method might not perform accuracy in practical environment.

This paper demonstrates the effectiveness of the imbalance with actual proxy logs. To mitigate the imbalance, this paper proposes how to generate a balanced corpus from actual proxy logs. Our method extracts important words from proxy logs based on the TFIDF (Term Frequency Inverse Document Frequency) scores. TFIDF is a numerical statistic that is intended to reflect word importance. To the best of our knowledge, the sole example using TFIDF in the field of network security is extracting features of malware from host logs [3]. Our method can detect unseen malicious traffic from network devices which monitor a wide range.

The main contributions of this paper are as follows: (1) Demonstrated that dominant benign traffic occupied malicious feature representation. (2) Proposed a method to mitigate the imbalance with TFIDF scores. (3) Verified that our method could detect unseen malicious traffic in practical environment.

The rest of the paper is organized as follows. Next section describes Natural Language Processing (NLP) techniques. Section 3 introduces the linguistic-based detection method and indicates the imbalance. Section 4 proposes the method to mitigate the imbalance. Section 5 demonstrates that dominant benign traffic occupies malicious feature representation, and shows the effectiveness of our method. Section 6 evaluates the result and Sect. 7 discusses related work.

2 Natural Language Processing (NLP) Technique

2.1 Paragraph Vector (Doc2vec)

Paragraph Vector (Doc2vec) [4] is an extension of Word2vec, and constructs embedding from entire documents. Word2vec is a shallow two-layer neural network that is trained to reconstruct linguistic contexts of words. Word2vec has two models to produce a distributed representation of words. Continuous-Bag-of-Words (CBoW) model predicts the current word from a window of surrounding context words. Skip-gram model uses the current word to predict the surrounding window of context words with the order reversed. These models enable to calculate the semantic similarity between two words and infer similar words semantically. The same idea has been extended to entire documents. In the same manner, Doc2vec has two models to produce Paragraph Vector a distributed representation of entire documents. Distributed-Memory (DM) is the extension of CBoW, and the only change is adding a document ID as a window of surrounding context words. Distributed-Bag-of-Words (DBoW) is the extension of skip-gram, and the current word was replaced by the current document ID. Doc2vec enables to calculate semantic similarity between two documents and infer similar documents semantically.

2.2 TFIDF (Term Frequency Inverse Document Frequency)

TFIDF is a numerical statistic that is intended to reflect word importance to a document in a collection or corpus, and one of the most popular term-weighting schemes. A TFIDF score increases proportionally to the number of times a word appears in the document, and is often offset by the frequency of the word in the corpus. TFIDF is the product of two statistics, term frequency and inverse document frequency. Term frequency is the number of times that a term occurs in a document. Inverse document frequency is a measure of how much information the term provides. This means whether the term is common or rare across all documents.

TFIDF of term i in document j in a corpus of D documents is calculated as follows.

$$\begin{aligned} TFIDF_{i,j}= & {} frequency_{i,j} \times \log _{2}{\frac{D}{document\_frequency_{i}}} \end{aligned}$$

A high weight in TFIDF is reached by a high term frequency in the given document and a low document frequency of the term in the whole collection of documents. Therefore, the weights tend to exclude common terms. The ratio inside the log function is always greater than or equal to 1. Hence, the TFIDF score is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the TFIDF score closer to 0.

Our method utilizes TFIDF scores to extract important unique words from proxy logs.

3 Linguistic-Based Detection Method

3.1 Extracting Words

Based on a linguistic approach, a malicious traffic detection method for proxy logs has been proposed [1]. The key idea of this method is treating proxy logs as a natural language. Proxy logs contain date and time at which transaction completed, request line from the client (includes the method, the URL and the user agent), HTTP status code returned to the client and size of the object returned to the client.

This method divides a single log line into HTTP status code, request line from the client, size of the object returned to the client and user agent by a single space. Then, the request line is divided into method, URL and protocol version by a single space. Furthermore, this method divides the URL into words by the delimiters which are “dot” (.), “slash” (/), “question mark” (?), “equal” (=) and “and” (&). This method treats each strings separated by a single space or the delimiters as a word. A single paragraph consists of the words extracted from 10 log lines. Thus, this method derives paragraphs from proxy logs.

3.2 Previous Method

The previous method [1] uses two types of machine learning techniques. One is Doc2vec an unsupervised learning model for feature extraction. This method constructs a Doc2vec model from the paragraphs and trains the model. The paragraphs are derived from benign and malicious proxy logs. Note that these logs have to be known as benign or malicious. Then, the paragraphs are converted into feature vectors. In this process, the feature representation is automatically extracted by the model. That is why this method utilizes the neural network model. The other is a supervised learning model to classify the feature vectors. This method trains the classifiers with the feature vectors and labels.

Subsequently, this method derives paragraphs from unknown proxy logs. The paragraphs are converted into feature vectors with the trained Doc2vec model. Finally, the trained classifiers predict the label from the feature vectors.

3.3 Imbalance

This method performed good accuracy in the balanced datasets, which were generated from benign and malicious pcap files at the same ratio [1]. This method requires affordable benign and malicious corpuses continually. Because new websites are being created constantly as with malware. The benign corpus is extracted from proxy logs in each organization. The malicious corpus is extracted from malicious pcap files which are downloaded from the websites such as Malware-Traffic-Analysis.netFootnote 1. In reality, proxy logs in a large organization contain a huge amount of words. In contrast, these malicious pcap files contain only limited words. If we generate a dataset from both words at the same ratio, the generated corpus cannot represent benign feature adequately. Because the corpus contains only benign words as much as words in malicious corpus. If we merely generate a dataset from both whole words, an imbalance occurs. In this case, benign words occupy most of the corpus. Such a corpus cannot generate discriminative feature representation. Thus, benign traffic is dominant, and occupies malicious feature representation. Therefore, this method might not perform accuracy in practical environment. A further section describes the detail. To perform accuracy in practical environment, this method needs a balanced corpus.

4 Proposed Method

4.1 Notion

To mitigate the imbalance, this paper pursues how to generate a balanced corpus from actual traffic. In general, benign proxy logs contain a huge amount of words. In contrast, malicious pcap files contain only limited words. To generate a balanced corpus, both numbers of unique words have to be adjusted at the same ratio. To adjust the number of unique words, our method utilizes TFIDF scores. TFIDF is one of the most popular term-weighting schemes, and reflects word importance to a document in a corpus. Our method extracts important words based on the TFIDF scores, and removes the ephemeral words. This provides two benefits: generating a balanced corpus and improving classification accuracy. Our method extracts important words from benign proxy logs, and reduces the number of unique words. Then, we can obtain a balanced corpus which contains both appropriate unique words. Furthermore, our method attempts to extract important words from malicious pcap files. This might improve classification accuracy slightly.

4.2 Overview

This paper proposes a practical linguistic-based detection method which includes how to generate a balanced corpus. Figure 1 shows an overview of the proposed method.

Fig. 1.
figure 1

An overview of the proposed method.

Updated processes are surrounded by a broken line. First, our method converts malicious and benign proxy logs or pcap files into words (1). Pcap files are converted into pseudo proxy logs. Both logs are separated by a single space or the delimiters. A single paragraph consists of the words extracted from 10 log lines. The number of log lines was determined by an empirical approach. Next, our method constructs TFIDF models from the malicious and benign words (2ab), and extracts top-N important words (3ab). Both models have to be constructed respectively. Hence, the extracted top-N unique words are different. Because these unique words have to equally represent both traffic. Note that both models use the same N. To construct a balanced corpus, both numbers of unique words have to be the same number. That is why our method extracts top-N important words, does not extract words above a threshold. The default value for N is the unique word number of the malicious words. N is a parameter value to adjust the number of unique words. Our method extracts important words based on the TFIDF score, and removes ephemeral words. Thus, our method mingles both words, and obtains a balanced corpus. Then, our method trains a Doc2vec model with the balanced corpus, and converts both words into feature vectors with the trained model (4). Finally, our method trains classifiers with the feature vectors and labels (5). The classifiers are Support Vector Machine (SVM), Random Forests (RF) and Multi-Layer Perceptron (MLP).

These are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training data, each labeled as belonging to one or the other of two categories, these training algorithms build a model that assigns new examples to one category or the other.

Subsequently, our method derives paragraphs from unknown proxy logs in the same way. This method extracts the top-N important words from the paragraphs, which were extracted in the training process. The modified paragraphs are converted into feature vectors with the trained Doc2vec model. Then, the trained classifiers predicts the label from the feature vectors. The predicted label is either malicious or benign.

Table 1. The detail of the MTA dataset.

4.3 Implementation

We implemented our method with gensim-1.2.1, scikit-learn-0.19.1 and chainer-2.0.12. Gensim is a Python library to realize unsupervised semantic modelling from plain text, and provides a Doc2vec model. Scikit-learn is a machine-learning library for Python that provides tools for data mining with a focus on machine learning, which contain SVM and RF. Chainer is a flexible Python framework for neural networks which contain MLP. We use the same models and set the same parameters for comparison with the previous method [1].

5 Experiment

5.1 Dataset

In this experiment, we use captured pcap files from Exploit Kit (EK) between 2014 and 2017 as malicious traffic. EK is a software kit designed to run on web servers, with the purpose of identifying vulnerabilities in client computers. EK seeks and exploits vulnerabilities to upload and execute malicious code on the client computer. We chose some EKs which communicate via a standard protocol and attempt to imitate benign http communication, and downloaded the packet traces from the website MALWARE-TRAFFIC-ANALYSIS.NET. Table 1 shows the detail.

This dataset (MTA) contains the pcap files extracted from traffic of the latest EKs. Our method aims to detect malicious traffic from proxy logs. Thus, we converted these pcap files into pseudo proxy logs. We extracted http traffic from these pcap files and concatenated the requests and responses.

This paper uses actual proxy logs between 2016 and 2017 as benign traffic. These actual proxy logs were collected at a campus network, which belongs to a class B network. This network consists of more than 5,000 computers. Due to our computer resource constraints, we extracted 1G bytes of logs in 2016 and 2017 respectively. We assume that these proxy logs represent benign traffic.

We mingle the malicious logs and benign logs into datasets. We split the datasets into training data and test data for timeline analysis. This paper uses three metrics: Precision (P), Recall (R) and F-measure (F).

5.2 Demonstrating the Imbalance

To compare the previous method [1] with the proposed method, we conducted timeline analysis. In the timeline analysis, we used 2014’s MTA and 2016’s proxy logs as the training data, and 2015’s MTA and 2017’s proxy logs as the test data. Table 2 shows the numbers of unique words and vectors. The previous method does not extract important words. The number of original unique words in benign traffic is 608,805. The previous method constructed a Doc2vec model from the original words. On the other hand, the proposed method extracted 8,200 important words from benign traffic, and constructed a Doc2vec model from the reduced words. The numbers of vectors by both methods are the same, and the numbers of the malicious vectors and benign vectors are quite different. This means the experimental environment is more fair and practical.

Table 2. The numbers of unique words and vectors.

Next, Table 3 shows performance of the timeline analysis. In spite of reducing words, the performance of the benign traffic is almost perfect. We focus on the malicious traffic. The proposed method maintains the performance to a degree in the timeline analysis. The best F-measure has reached 0.90. With the previous method, the decline is too large to detect malicious traffic. The previous method shows quite poor performance. This is because dominant benign traffic occupies malicious feature representation. Recall that the number of unique words in benign traffic was too larger than the number in malicious traffic. Thus, the previous method does not perform accuracy in practical environment. The proposed method can detect unseen malicious traffic in actual proxy logs.

Table 3. Performance of the timeline analysis.

5.3 Measuring Long-Term Performance

To measure long-term performance of the proposed method, we conducted timeline analysis with the all datasets. In this experiment, we used from 2014’s to 2016’s MTA as the training data and 2015’s to 2017’s MTA as the test data respectively. In the entire experiment, we used 2016’s proxy logs as the training data and 2017’s proxy logs as the test data. Moreover, we optimized the numbers of both unique words (N). Table 4 shows performance of the long-term analysis.

Table 4. Performance of the long-term analysis.

For the same reason, we focus on the performance of the malicious traffic. In the case that we used 2014’s for training data and the others for test data, the best F-measure has reached 0.94. Overall, performance gradually decreases every year. Three years later, an F-measure maintains 0.73. In the case that we used 2015’s for training data and the following for test data, the best F-measure has reached 0.86. Next year, an F-measure maintains 0.67. In the case that we used 2016’s for training data and 2017’s for test data, the best F-measure has reached 0.83.

6 Discussion

As a result, our method using proxy logs could detect the latest EKs almost precisely in practical environment. In the practical environment, the previous method [1] showed quite poor performance. This is because dominant benign traffic occupies malicious feature representation. Our method could extract important words from benign traffic, and construct a balanced corpus. That is the reason that our method performed good accuracy in practical environment. Furthermore, our method optimized the numbers of both unique words, extracting important words from both benign traffic and malicious traffic. This adjustment improved classification accuracy slightly. Therefore, extracting important words with TFIDF scores from proxy logs is efficient, and can mitigate the imbalance.

Our method requires several tens of minutes to construct a Doc2vec model and train a classifier in practical environment. In this experiment, our method took roughly from 30 min to an hour. The greater the pcap files or proxy logs to construct the models, the longer the required time. However, our method can construct the models in advance. Therefore, our method performs within a practical time. Our method can classify unknown logs with these pre-trained models in a few seconds. Thus, our method can analyze network traffic or proxy logs in real time.

In this paper, we used malicious pcap files and benign proxy logs. We can obtain these files from the websites such as MTA, and own network system. Though these files might contain privacy sensitive information such as personal information, email addresses or client’s IP addresses. To detect unseen malicious traffic, our method requires only pre-trained models. These models do not include any payload and logs. Furthermore, our method does not require client’s IP addresses and even distinguishing the client’s sources. This means our method accepts most vantage points to monitor traffic.

7 Related Work

Invernizzi et al. [5] built a network graph with IP addresses, domain names, FQDNs, URLs, paths and file names. Their method focuses on the correlation among nodes to detect malware distribution. In this method, the whole parameters are obtained from proxy logs. This method, however, has to cover many ranges of IP addresses, and performs in large-scale networks such as ISPs. In addition, this method requires the downloaded file types. Nelms et al. [6] proposed a trace back system which could go back to the source from the URL transfer graph. This method uses hop counts, domain age and common features of the domain names to detect malicious URLs. Bartos et al. [7] categorized proxy logs into flows, and extracted various features from the flows. They proposed how to learn the feature vectors to classify malicious URLs. This method can decide the optimum feature vectors automatically. However, this method demands devising basic features for learning. Mimura et al. [8] categorized proxy logs by FQDNs to extract feature vectors, and proposed a RAT (Remote Access Trojan or Remote Administration Tool) detection method using machine learning techniques. This method uses the characteristic that RATs continues to access the same path regularly. This method, however, performs for only C&C traffic. Shibahara et al. [9] focus on a sequence of URLs which include malicious artifacts of malicious redirections, and built a detection system which uses Convolutional Neural Networks. This method uses a honey client to collect URL sequences and their labels, and performs for DbD attacks. Our method uses only proxy logs and does not require any additional information and does not require devising feature vectors at all. In addition, our method performs at any scale and can detect not only DbD attacks but also C&C traffic.

8 Conclusion

In this paper, we focused on the imbalance that benign traffic was dominant and reduced malicious feature representation. This paper demonstrated that the previous method [1] was not effective in actual proxy logs because of the imbalance. To mitigate the imbalance, this paper proposed how to generate a balanced corpus from actual proxy logs. Our method extracts important words from proxy logs based on the TFIDF scores. We conducted timeline analysis with actual proxy logs. The experimental results showed our method could detect unseen malicious traffic in the practical environment. The best F-measure achieved 0.94 in the timeline analysis. Our method does not rely on attack techniques and does not demand devising feature vectors. Furthermore, our method is adaptable, fast and has a few constraints in practical use.

In this paper, we used the datasets which were generated by mixing malicious pcap files and benign logs. Applying our method to other proxy logs is a future work.