Learning Representations for Log Data in Cybersecurity

Ignacio Arnaldo¹⁵,
Alfredo Cuesta-Infante¹⁶,
Ankit Arun¹⁵,
Mei Lam¹⁵,
Costas Bassias¹⁵ &
…
Kalyan Veeramachaneni¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10332))

Included in the following conference series:

International Conference on Cyber Security Cryptography and Machine Learning

2130 Accesses
3 Altmetric

Abstract

We introduce a framework for exploring and learning representations of log data generated by enterprise-grade security devices with the goal of detecting advanced persistent threats (APTs) spanning over several weeks. The presented framework uses a divide-and-conquer strategy combining behavioral analytics, time series modeling and representation learning algorithms to model large volumes of data. In addition, given that we have access to human-engineered features, we analyze the capability of a series of representation learning algorithms to complement human-engineered features in a variety of classification approaches. We demonstrate the approach with a novel dataset extracted from 3 billion log lines generated at an enterprise network boundaries with reported command and control communications. The presented results validate our approach, achieving an area under the ROC curve of 0.943 and 95 true positives out of the Top 100 ranked instances on the test data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The LeWiS Method: Target Variable Estimation Using Cyber Security Intelligence

Article 18 April 2024

Detecting Unknown Cyber Security Attacks Through System Behavior Analysis

Cybersecurity Data Science: Concepts, Algorithms, and Applications

Notes

1.
Depending on an organization’s size and level of activity, devices such as next-generation firewalls can generate up to 1TB of log data and involve tens of millions of entities on a daily basis.
2.
In enterprises today, logs are generated by network devices, endpoints, and user authentication servers, as well as by a myriad of applications. Each device registers a certain kind of activity, and outputs different information. Note that even devices belonging to the same category (eg. network devices such as firewalls) report different information and use a different format depending on the vendor and version.
3.
For numeric fields, aggregations include minimum, maximum, average, and standard deviation; for categorical values, common aggregations are count_distinct and mode.
4.
For example, if we consider a dataset spanning over 10 days with $n=1000$ entity instances, a time step $t=1$ day, and $p=20$ aggregations, the result of this step would be a $1000 \times 10\times 20$ array.

References

Adversarial tactics, techniques and common knowledge. https://attack.mitre.org
KDD Cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Malware capture facility project. http://mcfp.weebly.com/
VirusTotal. https://www.virustotal.com
Beigi, E.B., Jazi, H.H., Stakhanova, N., Ghorbani, A.A.: Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE Conference on Communications and Network Security, pp. 247–255 (2014)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive (2015)
Google Scholar
Draper-Gil, G., Lashkari, A.H., Mamun, M.S.I., Ghorbani, A.A.: Characterization of encrypted and VPN traffic using time-related features. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy, ICISSP, vol. 1, pp. 407–414 (2016)
Google Scholar
García, S., Uhlíř, V., Rehak, M.: Identifying and modeling botnet C&C behaviors. In: Proceedings of the 1st International Workshop on Agents and CyberSecurity, ACySE 2014, NY, USA, pp. 1:1–1:8. ACM, New York (2014)
Google Scholar
Garcia, S., Zunino, A., Campo, M.: Survey on network-based botnet detection methods. Secur. Commun. Netw. 7(5), 878–903 (2014)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jiang, H., Nagra, J., Ahammad, P.: Sok: applying machine learning in security-a survey. arXiv preprint arXiv:1611.03186 (2016)
Kim, S., Smyth, P., Luther, S.: Modeling waveform shapes with random effects segmental hidden Markov models. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 309–316. AUAI Press, Arlington (2004)
Google Scholar
Nanopoulos, A., Alcock, R., Manolopoulos, Y.: Information processing and technology. In: Feature-based Classification of Time-series Data, pp. 49–61. Nova Science Publishers Inc, Commack (2001)
Google Scholar
Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 2016), pp. 263–278. USENIX Association, Austin (2016)
Google Scholar
Rodríguez, J.J., Alonso, C.J.: Interval and dynamic time warping-based decision trees. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, NY, USA, pp. 548–552. ACM, New York (2004)
Google Scholar
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR abs/1402.1128 (2014)
Google Scholar
Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 31(3), 357–374 (2012)
Article Google Scholar
Sood, A., Enbody, R.: Targeted Cyber Attacks: Multi-staged Attacks Driven by Exploits and Malware, 1st edn. Syngress Publishing, Burlington (2014)
Google Scholar
Staudemeyer, R.C., Omlin, C.W.: Evaluating performance of long short-term memory recurrent neural networks on intrusion detection data. In: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, SAICSIT 2013, NY, USA, pp. 218–224. ACM, New York (2013)
Google Scholar
Stevanovic, M., Pedersen, J.M.: On the use of machine learning for identifying botnet network traffic. J. Cyber. Secur. Mobility 4(3), 1–32 (2016)
Article Google Scholar
Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., Robinson, S.: Deep learning for unsupervised insider threat detection in structured cybersecurity data streams (2017)
Google Scholar
Veeramachaneni, K., Arnaldo, I., Korrapati, V., Bassias, C., Li, K.: AI$^2$: training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 49–54 (2016)
Google Scholar
Wang, Z., Oates, T.: Imaging time-series to improve classification and imputation. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 2015, pp. 3939–3945. AAAI Press (2015)
Google Scholar
Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)
Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, NY, USA, pp. 1033–1040. ACM, New York (2006)
Google Scholar
Zhao, D., Traore, I., Sayed, B., Lu, W., Saad, S., Ghorbani, A., Garant, D.: Botnet detection based on traffic behavior analysis and flow intervals. Comput. Secur. 39, 2–16 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

PatternEx Inc, San Jose, CA, USA
Ignacio Arnaldo, Ankit Arun, Mei Lam & Costas Bassias
Universidad Rey Juan Carlos, Madrid, Spain
Alfredo Cuesta-Infante
MIT, Cambridge, MA, USA
Kalyan Veeramachaneni

Authors

Ignacio Arnaldo
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuesta-Infante
View author publications
You can also search for this author in PubMed Google Scholar
Ankit Arun
View author publications
You can also search for this author in PubMed Google Scholar
Mei Lam
View author publications
You can also search for this author in PubMed Google Scholar
Costas Bassias
View author publications
You can also search for this author in PubMed Google Scholar
Kalyan Veeramachaneni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ignacio Arnaldo .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev , Beer-Sheva, Israel
Shlomi Dolev
Tata Consultancy Services (India) , Chennai, Tamil Nadu, India
Sachin Lodha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Arnaldo, I., Cuesta-Infante, A., Arun, A., Lam, M., Bassias, C., Veeramachaneni, K. (2017). Learning Representations for Log Data in Cybersecurity. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-60080-2_19
Published: 02 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60079-6
Online ISBN: 978-3-319-60080-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics