Abstract
We introduce a framework for exploring and learning representations of log data generated by enterprise-grade security devices with the goal of detecting advanced persistent threats (APTs) spanning over several weeks. The presented framework uses a divide-and-conquer strategy combining behavioral analytics, time series modeling and representation learning algorithms to model large volumes of data. In addition, given that we have access to human-engineered features, we analyze the capability of a series of representation learning algorithms to complement human-engineered features in a variety of classification approaches. We demonstrate the approach with a novel dataset extracted from 3 billion log lines generated at an enterprise network boundaries with reported command and control communications. The presented results validate our approach, achieving an area under the ROC curve of 0.943 and 95 true positives out of the Top 100 ranked instances on the test data set.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Depending on an organization’s size and level of activity, devices such as next-generation firewalls can generate up to 1TB of log data and involve tens of millions of entities on a daily basis.
- 2.
In enterprises today, logs are generated by network devices, endpoints, and user authentication servers, as well as by a myriad of applications. Each device registers a certain kind of activity, and outputs different information. Note that even devices belonging to the same category (eg. network devices such as firewalls) report different information and use a different format depending on the vendor and version.
- 3.
For numeric fields, aggregations include minimum, maximum, average, and standard deviation; for categorical values, common aggregations are count_distinct and mode.
- 4.
For example, if we consider a dataset spanning over 10 days with \(n=1000\) entity instances, a time step \(t=1\) day, and \(p=20\) aggregations, the result of this step would be a \(1000 \times 10\times 20\) array.
References
Adversarial tactics, techniques and common knowledge. https://attack.mitre.org
KDD Cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Malware capture facility project. http://mcfp.weebly.com/
VirusTotal. https://www.virustotal.com
Beigi, E.B., Jazi, H.H., Stakhanova, N., Ghorbani, A.A.: Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE Conference on Communications and Network Security, pp. 247–255 (2014)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive (2015)
Draper-Gil, G., Lashkari, A.H., Mamun, M.S.I., Ghorbani, A.A.: Characterization of encrypted and VPN traffic using time-related features. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy, ICISSP, vol. 1, pp. 407–414 (2016)
García, S., Uhlíř, V., Rehak, M.: Identifying and modeling botnet C&C behaviors. In: Proceedings of the 1st International Workshop on Agents and CyberSecurity, ACySE 2014, NY, USA, pp. 1:1–1:8. ACM, New York (2014)
Garcia, S., Zunino, A., Campo, M.: Survey on network-based botnet detection methods. Secur. Commun. Netw. 7(5), 878–903 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jiang, H., Nagra, J., Ahammad, P.: Sok: applying machine learning in security-a survey. arXiv preprint arXiv:1611.03186 (2016)
Kim, S., Smyth, P., Luther, S.: Modeling waveform shapes with random effects segmental hidden Markov models. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 309–316. AUAI Press, Arlington (2004)
Nanopoulos, A., Alcock, R., Manolopoulos, Y.: Information processing and technology. In: Feature-based Classification of Time-series Data, pp. 49–61. Nova Science Publishers Inc, Commack (2001)
Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 2016), pp. 263–278. USENIX Association, Austin (2016)
Rodríguez, J.J., Alonso, C.J.: Interval and dynamic time warping-based decision trees. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, NY, USA, pp. 548–552. ACM, New York (2004)
Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR abs/1402.1128 (2014)
Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 31(3), 357–374 (2012)
Sood, A., Enbody, R.: Targeted Cyber Attacks: Multi-staged Attacks Driven by Exploits and Malware, 1st edn. Syngress Publishing, Burlington (2014)
Staudemeyer, R.C., Omlin, C.W.: Evaluating performance of long short-term memory recurrent neural networks on intrusion detection data. In: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, SAICSIT 2013, NY, USA, pp. 218–224. ACM, New York (2013)
Stevanovic, M., Pedersen, J.M.: On the use of machine learning for identifying botnet network traffic. J. Cyber. Secur. Mobility 4(3), 1–32 (2016)
Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., Robinson, S.: Deep learning for unsupervised insider threat detection in structured cybersecurity data streams (2017)
Veeramachaneni, K., Arnaldo, I., Korrapati, V., Bassias, C., Li, K.: AI\(^2\): training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 49–54 (2016)
Wang, Z., Oates, T.: Imaging time-series to improve classification and imputation. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 2015, pp. 3939–3945. AAAI Press (2015)
Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)
Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, NY, USA, pp. 1033–1040. ACM, New York (2006)
Zhao, D., Traore, I., Sayed, B., Lu, W., Saad, S., Ghorbani, A., Garant, D.: Botnet detection based on traffic behavior analysis and flow intervals. Comput. Secur. 39, 2–16 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Arnaldo, I., Cuesta-Infante, A., Arun, A., Lam, M., Bassias, C., Veeramachaneni, K. (2017). Learning Representations for Log Data in Cybersecurity. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-60080-2_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60079-6
Online ISBN: 978-3-319-60080-2
eBook Packages: Computer ScienceComputer Science (R0)