A Linguistic Approach Towards Intrusion Detection in Actual Proxy Logs

Mamoru Mimura²¹ &
Hidema Tanaka²¹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 11149))

Included in the following conference series:

International Conference on Information and Communications Security

1653 Accesses
4 Citations

Abstract

Modern malware imitates benign http traffic to evade detection. To detect unseen malicious traffic, a linguistic-based detection method for proxy logs has been proposed. This method uses Paragraph Vector to extract features automatically. To generate discriminative feature representation, a balanced corpus is required. In actual proxy logs, benign traffic is dominant, and occupies malicious feature representation. Therefore, the previous method does not perform accuracy in practical environment.

This paper demonstrates that the previous method is not effective in actual proxy logs because of the imbalance. To mitigate the imbalance, our method extracts important words from proxy logs based on the TFIDF (Term Frequency Inverse Document Frequency) scores. The experimental results show our method can detect unseen malicious traffic in actual proxy logs. The best F-measure achieves 0.94 in the timeline analysis.

You have full access to this open access chapter, Download conference paper PDF

Reading Network Packets as a Natural Language for Intrusion Detection

Generative Adversarial Network for Enhancement Network Security Log Detection

Heavy Log Reader: Learning the Context of Cyber Attacks Automatically with Paragraph Vector

Keywords

1 Introduction

Modern malware imitates benign http traffic to evade detection. To detect unseen malicious traffic, a linguistic-based detection method for proxy logs has been proposed [1]. This method uses Paragraph Vector to extract features automatically. In the previous work, this method performed good accuracy in the balanced datasets. To generate discriminative feature representation with Paragraph Vector, a balanced corpus is required. If a corpus contains too much of words and too little of others, an imbalance occurs. Because, the large corpus occupies most of the whole corpus. Such a corpus cannot generate discriminative feature representation. This method expects that the corpus is extracted from actual benign proxy logs and malicious pcap files. Actual proxy logs contain a huge amount of words. To represent a huge amount of words, a corpus requires long-term and much traffic. In contrast, malicious pcap files contain only limited words. Thus, benign traffic is dominant, and occupies malicious feature representation. This is due to the base-rate fallacy phenomenon [2]. Therefore, this method might not perform accuracy in practical environment.

This paper demonstrates the effectiveness of the imbalance with actual proxy logs. To mitigate the imbalance, this paper proposes how to generate a balanced corpus from actual proxy logs. Our method extracts important words from proxy logs based on the TFIDF (Term Frequency Inverse Document Frequency) scores. TFIDF is a numerical statistic that is intended to reflect word importance. To the best of our knowledge, the sole example using TFIDF in the field of network security is extracting features of malware from host logs [3]. Our method can detect unseen malicious traffic from network devices which monitor a wide range.

The main contributions of this paper are as follows: (1) Demonstrated that dominant benign traffic occupied malicious feature representation. (2) Proposed a method to mitigate the imbalance with TFIDF scores. (3) Verified that our method could detect unseen malicious traffic in practical environment.

The rest of the paper is organized as follows. Next section describes Natural Language Processing (NLP) techniques. Section 3 introduces the linguistic-based detection method and indicates the imbalance. Section 4 proposes the method to mitigate the imbalance. Section 5 demonstrates that dominant benign traffic occupies malicious feature representation, and shows the effectiveness of our method. Section 6 evaluates the result and Sect. 7 discusses related work.

2 Natural Language Processing (NLP) Technique

2.1 Paragraph Vector (Doc2vec)

Paragraph Vector (Doc2vec) [4] is an extension of Word2vec, and constructs embedding from entire documents. Word2vec is a shallow two-layer neural network that is trained to reconstruct linguistic contexts of words. Word2vec has two models to produce a distributed representation of words. Continuous-Bag-of-Words (CBoW) model predicts the current word from a window of surrounding context words. Skip-gram model uses the current word to predict the surrounding window of context words with the order reversed. These models enable to calculate the semantic similarity between two words and infer similar words semantically. The same idea has been extended to entire documents. In the same manner, Doc2vec has two models to produce Paragraph Vector a distributed representation of entire documents. Distributed-Memory (DM) is the extension of CBoW, and the only change is adding a document ID as a window of surrounding context words. Distributed-Bag-of-Words (DBoW) is the extension of skip-gram, and the current word was replaced by the current document ID. Doc2vec enables to calculate semantic similarity between two documents and infer similar documents semantically.

2.2 TFIDF (Term Frequency Inverse Document Frequency)

TFIDF is a numerical statistic that is intended to reflect word importance to a document in a collection or corpus, and one of the most popular term-weighting schemes. A TFIDF score increases proportionally to the number of times a word appears in the document, and is often offset by the frequency of the word in the corpus. TFIDF is the product of two statistics, term frequency and inverse document frequency. Term frequency is the number of times that a term occurs in a document. Inverse document frequency is a measure of how much information the term provides. This means whether the term is common or rare across all documents.

TFIDF of term i in document j in a corpus of D documents is calculated as follows.

$$\begin{aligned} TFIDF_{i,j}= & {} frequency_{i,j} \times \log _{2}{\frac{D}{document\_frequency_{i}}} \end{aligned}$$

A high weight in TFIDF is reached by a high term frequency in the given document and a low document frequency of the term in the whole collection of documents. Therefore, the weights tend to exclude common terms. The ratio inside the log function is always greater than or equal to 1. Hence, the TFIDF score is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the TFIDF score closer to 0.

Our method utilizes TFIDF scores to extract important unique words from proxy logs.

3 Linguistic-Based Detection Method

3.1 Extracting Words

Based on a linguistic approach, a malicious traffic detection method for proxy logs has been proposed [1]. The key idea of this method is treating proxy logs as a natural language. Proxy logs contain date and time at which transaction completed, request line from the client (includes the method, the URL and the user agent), HTTP status code returned to the client and size of the object returned to the client.

This method divides a single log line into HTTP status code, request line from the client, size of the object returned to the client and user agent by a single space. Then, the request line is divided into method, URL and protocol version by a single space. Furthermore, this method divides the URL into words by the delimiters which are “dot” (.), “slash” (/), “question mark” (?), “equal” (=) and “and” (&). This method treats each strings separated by a single space or the delimiters as a word. A single paragraph consists of the words extracted from 10 log lines. Thus, this method derives paragraphs from proxy logs.

3.2 Previous Method

The previous method [1] uses two types of machine learning techniques. One is Doc2vec an unsupervised learning model for feature extraction. This method constructs a Doc2vec model from the paragraphs and trains the model. The paragraphs are derived from benign and malicious proxy logs. Note that these logs have to be known as benign or malicious. Then, the paragraphs are converted into feature vectors. In this process, the feature representation is automatically extracted by the model. That is why this method utilizes the neural network model. The other is a supervised learning model to classify the feature vectors. This method trains the classifiers with the feature vectors and labels.

Subsequently, this method derives paragraphs from unknown proxy logs. The paragraphs are converted into feature vectors with the trained Doc2vec model. Finally, the trained classifiers predict the label from the feature vectors.

3.3 Imbalance

This method performed good accuracy in the balanced datasets, which were generated from benign and malicious pcap files at the same ratio [1]. This method requires affordable benign and malicious corpuses continually. Because new websites are being created constantly as with malware. The benign corpus is extracted from proxy logs in each organization. The malicious corpus is extracted from malicious pcap files which are downloaded from the websites such as Malware-Traffic-Analysis.net^{Footnote 1}. In reality, proxy logs in a large organization contain a huge amount of words. In contrast, these malicious pcap files contain only limited words. If we generate a dataset from both words at the same ratio, the generated corpus cannot represent benign feature adequately. Because the corpus contains only benign words as much as words in malicious corpus. If we merely generate a dataset from both whole words, an imbalance occurs. In this case, benign words occupy most of the corpus. Such a corpus cannot generate discriminative feature representation. Thus, benign traffic is dominant, and occupies malicious feature representation. Therefore, this method might not perform accuracy in practical environment. A further section describes the detail. To perform accuracy in practical environment, this method needs a balanced corpus.

4 Proposed Method

4.1 Notion

To mitigate the imbalance, this paper pursues how to generate a balanced corpus from actual traffic. In general, benign proxy logs contain a huge amount of words. In contrast, malicious pcap files contain only limited words. To generate a balanced corpus, both numbers of unique words have to be adjusted at the same ratio. To adjust the number of unique words, our method utilizes TFIDF scores. TFIDF is one of the most popular term-weighting schemes, and reflects word importance to a document in a corpus. Our method extracts important words based on the TFIDF scores, and removes the ephemeral words. This provides two benefits: generating a balanced corpus and improving classification accuracy. Our method extracts important words from benign proxy logs, and reduces the number of unique words. Then, we can obtain a balanced corpus which contains both appropriate unique words. Furthermore, our method attempts to extract important words from malicious pcap files. This might improve classification accuracy slightly.

4.2 Overview

This paper proposes a practical linguistic-based detection method which includes how to generate a balanced corpus. Figure 1 shows an overview of the proposed method.

Updated processes are surrounded by a broken line. First, our method converts malicious and benign proxy logs or pcap files into words (1). Pcap files are converted into pseudo proxy logs. Both logs are separated by a single space or the delimiters. A single paragraph consists of the words extracted from 10 log lines. The number of log lines was determined by an empirical approach. Next, our method constructs TFIDF models from the malicious and benign words (2ab), and extracts top-N important words (3ab). Both models have to be constructed respectively. Hence, the extracted top-N unique words are different. Because these unique words have to equally represent both traffic. Note that both models use the same N. To construct a balanced corpus, both numbers of unique words have to be the same number. That is why our method extracts top-N important words, does not extract words above a threshold. The default value for N is the unique word number of the malicious words. N is a parameter value to adjust the number of unique words. Our method extracts important words based on the TFIDF score, and removes ephemeral words. Thus, our method mingles both words, and obtains a balanced corpus. Then, our method trains a Doc2vec model with the balanced corpus, and converts both words into feature vectors with the trained model (4). Finally, our method trains classifiers with the feature vectors and labels (5). The classifiers are Support Vector Machine (SVM), Random Forests (RF) and Multi-Layer Perceptron (MLP).

These are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training data, each labeled as belonging to one or the other of two categories, these training algorithms build a model that assigns new examples to one category or the other.

Subsequently, our method derives paragraphs from unknown proxy logs in the same way. This method extracts the top-N important words from the paragraphs, which were extracted in the training process. The modified paragraphs are converted into feature vectors with the trained Doc2vec model. Then, the trained classifiers predicts the label from the feature vectors. The predicted label is either malicious or benign.

Table 1. The detail of the MTA dataset.

Full size table

4.3 Implementation

We implemented our method with gensim-1.2.1, scikit-learn-0.19.1 and chainer-2.0.12. Gensim is a Python library to realize unsupervised semantic modelling from plain text, and provides a Doc2vec model. Scikit-learn is a machine-learning library for Python that provides tools for data mining with a focus on machine learning, which contain SVM and RF. Chainer is a flexible Python framework for neural networks which contain MLP. We use the same models and set the same parameters for comparison with the previous method [1].

5 Experiment

5.1 Dataset

In this experiment, we use captured pcap files from Exploit Kit (EK) between 2014 and 2017 as malicious traffic. EK is a software kit designed to run on web servers, with the purpose of identifying vulnerabilities in client computers. EK seeks and exploits vulnerabilities to upload and execute malicious code on the client computer. We chose some EKs which communicate via a standard protocol and attempt to imitate benign http communication, and downloaded the packet traces from the website MALWARE-TRAFFIC-ANALYSIS.NET. Table 1 shows the detail.

This dataset (MTA) contains the pcap files extracted from traffic of the latest EKs. Our method aims to detect malicious traffic from proxy logs. Thus, we converted these pcap files into pseudo proxy logs. We extracted http traffic from these pcap files and concatenated the requests and responses.

This paper uses actual proxy logs between 2016 and 2017 as benign traffic. These actual proxy logs were collected at a campus network, which belongs to a class B network. This network consists of more than 5,000 computers. Due to our computer resource constraints, we extracted 1G bytes of logs in 2016 and 2017 respectively. We assume that these proxy logs represent benign traffic.

We mingle the malicious logs and benign logs into datasets. We split the datasets into training data and test data for timeline analysis. This paper uses three metrics: Precision (P), Recall (R) and F-measure (F).

5.2 Demonstrating the Imbalance

To compare the previous method [1] with the proposed method, we conducted timeline analysis. In the timeline analysis, we used 2014’s MTA and 2016’s proxy logs as the training data, and 2015’s MTA and 2017’s proxy logs as the test data. Table 2 shows the numbers of unique words and vectors. The previous method does not extract important words. The number of original unique words in benign traffic is 608,805. The previous method constructed a Doc2vec model from the original words. On the other hand, the proposed method extracted 8,200 important words from benign traffic, and constructed a Doc2vec model from the reduced words. The numbers of vectors by both methods are the same, and the numbers of the malicious vectors and benign vectors are quite different. This means the experimental environment is more fair and practical.

Table 2. The numbers of unique words and vectors.

Full size table

Next, Table 3 shows performance of the timeline analysis. In spite of reducing words, the performance of the benign traffic is almost perfect. We focus on the malicious traffic. The proposed method maintains the performance to a degree in the timeline analysis. The best F-measure has reached 0.90. With the previous method, the decline is too large to detect malicious traffic. The previous method shows quite poor performance. This is because dominant benign traffic occupies malicious feature representation. Recall that the number of unique words in benign traffic was too larger than the number in malicious traffic. Thus, the previous method does not perform accuracy in practical environment. The proposed method can detect unseen malicious traffic in actual proxy logs.

Table 3. Performance of the timeline analysis.

Full size table

5.3 Measuring Long-Term Performance

To measure long-term performance of the proposed method, we conducted timeline analysis with the all datasets. In this experiment, we used from 2014’s to 2016’s MTA as the training data and 2015’s to 2017’s MTA as the test data respectively. In the entire experiment, we used 2016’s proxy logs as the training data and 2017’s proxy logs as the test data. Moreover, we optimized the numbers of both unique words (N). Table 4 shows performance of the long-term analysis.

Table 4. Performance of the long-term analysis.

Full size table

For the same reason, we focus on the performance of the malicious traffic. In the case that we used 2014’s for training data and the others for test data, the best F-measure has reached 0.94. Overall, performance gradually decreases every year. Three years later, an F-measure maintains 0.73. In the case that we used 2015’s for training data and the following for test data, the best F-measure has reached 0.86. Next year, an F-measure maintains 0.67. In the case that we used 2016’s for training data and 2017’s for test data, the best F-measure has reached 0.83.

6 Discussion

As a result, our method using proxy logs could detect the latest EKs almost precisely in practical environment. In the practical environment, the previous method [1] showed quite poor performance. This is because dominant benign traffic occupies malicious feature representation. Our method could extract important words from benign traffic, and construct a balanced corpus. That is the reason that our method performed good accuracy in practical environment. Furthermore, our method optimized the numbers of both unique words, extracting important words from both benign traffic and malicious traffic. This adjustment improved classification accuracy slightly. Therefore, extracting important words with TFIDF scores from proxy logs is efficient, and can mitigate the imbalance.

Our method requires several tens of minutes to construct a Doc2vec model and train a classifier in practical environment. In this experiment, our method took roughly from 30 min to an hour. The greater the pcap files or proxy logs to construct the models, the longer the required time. However, our method can construct the models in advance. Therefore, our method performs within a practical time. Our method can classify unknown logs with these pre-trained models in a few seconds. Thus, our method can analyze network traffic or proxy logs in real time.

In this paper, we used malicious pcap files and benign proxy logs. We can obtain these files from the websites such as MTA, and own network system. Though these files might contain privacy sensitive information such as personal information, email addresses or client’s IP addresses. To detect unseen malicious traffic, our method requires only pre-trained models. These models do not include any payload and logs. Furthermore, our method does not require client’s IP addresses and even distinguishing the client’s sources. This means our method accepts most vantage points to monitor traffic.

7 Related Work

Invernizzi et al. [5] built a network graph with IP addresses, domain names, FQDNs, URLs, paths and file names. Their method focuses on the correlation among nodes to detect malware distribution. In this method, the whole parameters are obtained from proxy logs. This method, however, has to cover many ranges of IP addresses, and performs in large-scale networks such as ISPs. In addition, this method requires the downloaded file types. Nelms et al. [6] proposed a trace back system which could go back to the source from the URL transfer graph. This method uses hop counts, domain age and common features of the domain names to detect malicious URLs. Bartos et al. [7] categorized proxy logs into flows, and extracted various features from the flows. They proposed how to learn the feature vectors to classify malicious URLs. This method can decide the optimum feature vectors automatically. However, this method demands devising basic features for learning. Mimura et al. [8] categorized proxy logs by FQDNs to extract feature vectors, and proposed a RAT (Remote Access Trojan or Remote Administration Tool) detection method using machine learning techniques. This method uses the characteristic that RATs continues to access the same path regularly. This method, however, performs for only C&C traffic. Shibahara et al. [9] focus on a sequence of URLs which include malicious artifacts of malicious redirections, and built a detection system which uses Convolutional Neural Networks. This method uses a honey client to collect URL sequences and their labels, and performs for DbD attacks. Our method uses only proxy logs and does not require any additional information and does not require devising feature vectors at all. In addition, our method performs at any scale and can detect not only DbD attacks but also C&C traffic.

8 Conclusion

In this paper, we focused on the imbalance that benign traffic was dominant and reduced malicious feature representation. This paper demonstrated that the previous method [1] was not effective in actual proxy logs because of the imbalance. To mitigate the imbalance, this paper proposed how to generate a balanced corpus from actual proxy logs. Our method extracts important words from proxy logs based on the TFIDF scores. We conducted timeline analysis with actual proxy logs. The experimental results showed our method could detect unseen malicious traffic in the practical environment. The best F-measure achieved 0.94 in the timeline analysis. Our method does not rely on attack techniques and does not demand devising feature vectors. Furthermore, our method is adaptable, fast and has a few constraints in practical use.

In this paper, we used the datasets which were generated by mixing malicious pcap files and benign logs. Applying our method to other proxy logs is a future work.

Notes

1.
http://www.malware-traffic-analysis.net/.

References

Mimura, Mamoru, Tanaka, Hidema: Heavy log reader: learning the context of cyber attacks automatically with paragraph vector. In: Shyamasundar, Rudrapatna K., Singh, Virendra, Vaidya, Jaideep (eds.) ICISS 2017. LNCS, vol. 10717, pp. 146–163. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-72598-7_9
Chapter Google Scholar
Axelsson, S.: The base-rate fallacy and the difficulty of intrusion detection. ACM Tran. Inf. Syst. Secur. 3(3), 186–205 (2000)
Article Google Scholar
Chen, Q., Bridges, R.A.: Automated behavioral analysis of malware: a case study of WannaCry Ransomware. In: Proceedings of 2017 16th IEEE International Conference on Machine Learning and Applications, pp. 454–460 (2017)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of 31st International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Invernizzi, L., et al.: Nazca: detecting malware distribution in large-scale networks. In: Proceedings of Network and Distributed System Security Symposium (2014)
Google Scholar
Nelms, T., Perdisci, R., Antonakakis, M., Ahamad, M.: WebWitness: investigating, categorizing, and mitigating malware download paths. In: Proceedings of 24th USENIX Security Symposium, pp. 1025–1040 (2015)
Google Scholar
Bartos, K., Sofka, M.: Optimized invariant representation of network traffic for detecting unseen malware variants. In: Proceedings of 25th USENIX Security Symposium, pp. 806–822 (2016)
Google Scholar
Mimura, M., Otsubo, Y., Tanaka, H., Tanaka, H.: A practical experiment of the HTTP-based RAT detection method in proxy server logs. In: Proceedings of 12th Asia Joint Conference on Information Security, pp. 31–37 (2017)
Google Scholar
Shibahara, T., et al.: Malicious URL sequence detection using event de-noising convolutional neural network. In: Proceedings of IEEE ICC 2017 Communication and Information Systems Security Symposium, pp. 1–7 (2017)
Google Scholar

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number 17K06455.

Author information

Authors and Affiliations

National Defense Academy, 1-10-20 Hashirimizu, Yokosuka, Kanagawa, Japan
Mamoru Mimura & Hidema Tanaka

Authors

Mamoru Mimura
View author publications
You can also search for this author in PubMed Google Scholar
Hidema Tanaka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mamoru Mimura .

Editor information

Editors and Affiliations

Départment d'Informatique, Ecole Normale Supérieure, Paris, France
David Naccache
University of Texas, San Antonio, TX, USA
Shouhuai Xu
Chinese Academy of Sciences, Beijing, China
Sihan Qing
Dipto di Informatica, Univ degli Studi di Milano, Crema, Italy
Pierangela Samarati
Télécom SudParis, Evry, France
Gregory Blanc
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
IMT Lille Douai, Villeneuve-d'Ascq, France
Zonghua Zhang
IMT Lille Douai, Villeneuve-d'Ascq, France
Ahmed Meddahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mimura, M., Tanaka, H. (2018). A Linguistic Approach Towards Intrusion Detection in Actual Proxy Logs. In: Naccache, D., et al. Information and Communications Security. ICICS 2018. Lecture Notes in Computer Science(), vol 11149. Springer, Cham. https://doi.org/10.1007/978-3-030-01950-1_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-01950-1_42
Published: 26 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01949-5
Online ISBN: 978-3-030-01950-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Linguistic Approach Towards Intrusion Detection in Actual Proxy Logs

Abstract

Similar content being viewed by others

Reading Network Packets as a Natural Language for Intrusion Detection

Generative Adversarial Network for Enhancement Network Security Log Detection

Heavy Log Reader: Learning the Context of Cyber Attacks Automatically with Paragraph Vector

Keywords

1 Introduction

2 Natural Language Processing (NLP) Technique

2.1 Paragraph Vector (Doc2vec)

2.2 TFIDF (Term Frequency Inverse Document Frequency)

3 Linguistic-Based Detection Method

3.1 Extracting Words

3.2 Previous Method

3.3 Imbalance

4 Proposed Method

4.1 Notion

4.2 Overview

4.3 Implementation

5 Experiment

5.1 Dataset

5.2 Demonstrating the Imbalance

5.3 Measuring Long-Term Performance

6 Discussion

7 Related Work

8 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation