Nothing Special   »   [go: up one dir, main page]

Skip to main content

Learning Representations for Log Data in Cybersecurity

  • Conference paper
  • First Online:
Cyber Security Cryptography and Machine Learning (CSCML 2017)

Abstract

We introduce a framework for exploring and learning representations of log data generated by enterprise-grade security devices with the goal of detecting advanced persistent threats (APTs) spanning over several weeks. The presented framework uses a divide-and-conquer strategy combining behavioral analytics, time series modeling and representation learning algorithms to model large volumes of data. In addition, given that we have access to human-engineered features, we analyze the capability of a series of representation learning algorithms to complement human-engineered features in a variety of classification approaches. We demonstrate the approach with a novel dataset extracted from 3 billion log lines generated at an enterprise network boundaries with reported command and control communications. The presented results validate our approach, achieving an area under the ROC curve of 0.943 and 95 true positives out of the Top 100 ranked instances on the test data set.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Depending on an organization’s size and level of activity, devices such as next-generation firewalls can generate up to 1TB of log data and involve tens of millions of entities on a daily basis.

  2. 2.

    In enterprises today, logs are generated by network devices, endpoints, and user authentication servers, as well as by a myriad of applications. Each device registers a certain kind of activity, and outputs different information. Note that even devices belonging to the same category (eg. network devices such as firewalls) report different information and use a different format depending on the vendor and version.

  3. 3.

    For numeric fields, aggregations include minimum, maximum, average, and standard deviation; for categorical values, common aggregations are count_distinct and mode.

  4. 4.

    For example, if we consider a dataset spanning over 10 days with \(n=1000\) entity instances, a time step \(t=1\) day, and \(p=20\) aggregations, the result of this step would be a \(1000 \times 10\times 20\) array.

References

  1. Adversarial tactics, techniques and common knowledge. https://attack.mitre.org

  2. KDD Cup 99. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

  3. Malware capture facility project. http://mcfp.weebly.com/

  4. VirusTotal. https://www.virustotal.com

  5. Beigi, E.B., Jazi, H.H., Stakhanova, N., Ghorbani, A.A.: Towards effective feature selection in machine learning-based botnet detection approaches. In: 2014 IEEE Conference on Communications and Network Security, pp. 247–255 (2014)

    Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  7. Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: The UCR time series classification archive (2015)

    Google Scholar 

  8. Draper-Gil, G., Lashkari, A.H., Mamun, M.S.I., Ghorbani, A.A.: Characterization of encrypted and VPN traffic using time-related features. In: Proceedings of the 2nd International Conference on Information Systems Security and Privacy, ICISSP, vol. 1, pp. 407–414 (2016)

    Google Scholar 

  9. García, S., Uhlíř, V., Rehak, M.: Identifying and modeling botnet C&C behaviors. In: Proceedings of the 1st International Workshop on Agents and CyberSecurity, ACySE 2014, NY, USA, pp. 1:1–1:8. ACM, New York (2014)

    Google Scholar 

  10. Garcia, S., Zunino, A., Campo, M.: Survey on network-based botnet detection methods. Secur. Commun. Netw. 7(5), 878–903 (2014)

    Article  Google Scholar 

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  12. Jiang, H., Nagra, J., Ahammad, P.: Sok: applying machine learning in security-a survey. arXiv preprint arXiv:1611.03186 (2016)

  13. Kim, S., Smyth, P., Luther, S.: Modeling waveform shapes with random effects segmental hidden Markov models. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI 2004, pp. 309–316. AUAI Press, Arlington (2004)

    Google Scholar 

  14. Nanopoulos, A., Alcock, R., Manolopoulos, Y.: Information processing and technology. In: Feature-based Classification of Time-series Data, pp. 49–61. Nova Science Publishers Inc, Commack (2001)

    Google Scholar 

  15. Plohmann, D., Yakdan, K., Klatt, M., Bader, J., Gerhards-Padilla, E.: A comprehensive measurement study of domain generating malware. In: 25th USENIX Security Symposium (USENIX Security 2016), pp. 263–278. USENIX Association, Austin (2016)

    Google Scholar 

  16. Rodríguez, J.J., Alonso, C.J.: Interval and dynamic time warping-based decision trees. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, NY, USA, pp. 548–552. ACM, New York (2004)

    Google Scholar 

  17. Sak, H., Senior, A.W., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. CoRR abs/1402.1128 (2014)

    Google Scholar 

  18. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.: Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur. 31(3), 357–374 (2012)

    Article  Google Scholar 

  19. Sood, A., Enbody, R.: Targeted Cyber Attacks: Multi-staged Attacks Driven by Exploits and Malware, 1st edn. Syngress Publishing, Burlington (2014)

    Google Scholar 

  20. Staudemeyer, R.C., Omlin, C.W.: Evaluating performance of long short-term memory recurrent neural networks on intrusion detection data. In: Proceedings of the South African Institute for Computer Scientists and Information Technologists Conference, SAICSIT 2013, NY, USA, pp. 218–224. ACM, New York (2013)

    Google Scholar 

  21. Stevanovic, M., Pedersen, J.M.: On the use of machine learning for identifying botnet network traffic. J. Cyber. Secur. Mobility 4(3), 1–32 (2016)

    Article  Google Scholar 

  22. Tuor, A., Kaplan, S., Hutchinson, B., Nichols, N., Robinson, S.: Deep learning for unsupervised insider threat detection in structured cybersecurity data streams (2017)

    Google Scholar 

  23. Veeramachaneni, K., Arnaldo, I., Korrapati, V., Bassias, C., Li, K.: AI\(^2\): training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 49–54 (2016)

    Google Scholar 

  24. Wang, Z., Oates, T.: Imaging time-series to improve classification and imputation. In: Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI 2015, pp. 3939–3945. AAAI Press (2015)

    Google Scholar 

  25. Woodbridge, J., Anderson, H.S., Ahuja, A., Grant, D.: Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791 (2016)

  26. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast time series classification using numerosity reduction. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, NY, USA, pp. 1033–1040. ACM, New York (2006)

    Google Scholar 

  27. Zhao, D., Traore, I., Sayed, B., Lu, W., Saad, S., Ghorbani, A., Garant, D.: Botnet detection based on traffic behavior analysis and flow intervals. Comput. Secur. 39, 2–16 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ignacio Arnaldo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Arnaldo, I., Cuesta-Infante, A., Arun, A., Lam, M., Bassias, C., Veeramachaneni, K. (2017). Learning Representations for Log Data in Cybersecurity. In: Dolev, S., Lodha, S. (eds) Cyber Security Cryptography and Machine Learning. CSCML 2017. Lecture Notes in Computer Science(), vol 10332. Springer, Cham. https://doi.org/10.1007/978-3-319-60080-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-60080-2_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-60079-6

  • Online ISBN: 978-3-319-60080-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics