Nothing Special   »   [go: up one dir, main page]

Skip to main content

DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING

  • Conference paper
  • First Online:
Advances in Digital Forensics XVII (DigitalForensics 2021)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 612))

Included in the following conference series:

Abstract

Portable Document Format (PDF) documents are often used as carriers of malicious code that launch attacks or steal personal information. Traditional manual and supervised-learning-based detection methods rely heavily on labeled samples of malicious documents. But this is problematic because very few labeled malicious samples are available in real-world scenarios.

This chapter presents a semi-supervised machine learning method for detecting malicious PDF documents. It extracts structural features as well as statistical features based on entropy sequences using the wavelet energy spectrum. A random sub-sampling strategy is employed to train multiple sub-classifiers. Each classifier is independent, which enhances the generalization capability during detection. The semi-supervised learning method enables labeled as well as unlabeled samples to be used to classify malicious and benign PDF documents. Experimental results demonstrate that the method yields an accuracy of 94% despite using training data with just 11% labeled malicious samples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Adobe Systems, Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition 2008-7-1, PDF 32000-1:2008, First Edition 2008-7-1, San Jose, California, 2008.

    Google Scholar 

  2. A. Blonce, E. Filiol and L. Frayssignes, Portable Document Format (PDF) security analysis and malware threats, presented at the Black Hat Europe Conference, 2008.

    Google Scholar 

  3. G. Canfora, F. Mercaldo and C. Visaggio, An HMM and structural entropy based detector for Android malware: An empirical study, Computers and Security, vol. 61, pp. 1–18, 2016.

    Google Scholar 

  4. A. Cohen, N. Nissim, L. Rokach and Y. Elovici, SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods, Expert Systems with Applications, vol. 63, pp 324–343, 2016.

    Google Scholar 

  5. Contaigo, 16,800 Clean and 11,960 Malicious Files for Signature Testing and Research (contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html), March 24, 2013.

    Google Scholar 

  6. FireEye, Advanced Persistent Threat Groups, Milipitas, California (www.fireeye.com/current-threats/apt-groups.html), 2020.

    Google Scholar 

  7. D. Gibert, C. Mateu, J. Planes and R. Vicens, Classification of malware by using structural entropy on convolutional neural networks, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Thirtieth AAAI Conference on Innovative Applications of Artificial Intelligence and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 7759–7764, 2018.

    Google Scholar 

  8. A. Kaboutari, J. Bagherzadeh and F. Kheradmand, An evaluation of two-step techniques for positive-unlabeled learning in text classification, International Journal of Computer Applications Technology and Research, vol. 3(9), pp. 592–594, 2014.

    Google Scholar 

  9. M. Li, Y. Liu, M. Yu, G. Li, Y. Wang and C. Liu, FEPDF: A robust feature extractor for malicious PDF detection, Proceedings of the IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 218–224, 2017.

    Google Scholar 

  10. J. Lin and H. Pao, Multi-view malicious document detection, Proceedings of the Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, 2013.

    Google Scholar 

  11. L. Liu, X. He, L. Liu, L. Qing, Y. Fang and J. Liu, Capturing the symptoms of malicious code in electronic documents by file entropy signals combined with machine learning, Applied Soft Computing, vol. 82, article no. 105598, 2019.

    Google Scholar 

  12. X. Lu, F. Wang and Z. Shu, Malicious Word document detection based on multi-view feature learning, Proceedings of the Twenty-Eighth International Conference on Computer Communications and Networks, 2019.

    Google Scholar 

  13. D. Maiorca, D. Ariu, I. Corona and G. Giacinto, A structural and content-based approach for precise and robust detection of malicious PDF files, Proceedings of the International Conference on Information Systems Security and Privacy, pp. 27–36, 2015.

    Google Scholar 

  14. D. Maiorca, G. Giacinto and I. Corona, A pattern recognition system for malicious PDF file detection, Proceedings of the Eighth International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 510–524, 2012.

    Google Scholar 

  15. F. Mordelet and J. Vert, A bagging SVM to learn from positive and unlabeled examples, Pattern Recognition Letters, vol. 37, pp. 201–209, 2014.

    Google Scholar 

  16. J. Muller, F. Ising, V. Mladenov, C. Mainka, S. Schinzel and J. Schwenk, Practical decryption exfiltration: Breaking PDF encryption, Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 15–29, 2019.

    Google Scholar 

  17. C. Smutz and A. Stavrou, Malicious PDF detection using metadata and structural features, Proceedings of the Twenty-Eighth Annual Computer Security Applications Conference, pp. 239–248, 2012.

    Google Scholar 

  18. N. Srndic and P. Laskov, Detection of malicious PDF files based on hierarchical document structure, Proceedings of the Twentieth Annual Network and Distributed System Security Symposium, 2013.

    Google Scholar 

  19. N. Srndic and P. Laskov, Hidost: A static machine-learning-based detector of malicious files, EURASIP Journal on Information Security, vol. 2016(1), article no. 45, 2016.

    Google Scholar 

  20. J. Torres and S. De Los Santos, Malicious PDF document detection using machine learning techniques, Proceedings of the Fourth International Conference on Information Systems Security and Privacy, pp. 337–344, 2018.

    Google Scholar 

  21. Z. Tzermias, G. Sykiotakis, M. Polychronakis and E. Markatos, Combining static and dynamic analysis for the detection of malicious documents, Proceedings of the Fourth European Workshop on System Security, article no. 4, 2011.

    Google Scholar 

  22. VirusShare, Home (www.virusshare.com), 2020.

    Google Scholar 

  23. VirusTotal, GUI (www.virustotal.com/gui), 2020.

    Google Scholar 

  24. M. Xu and T. Kim, PlatPal: Detecting malicious documents with platform diversity, Proceedings of the Twenty-Sixth USENIX Security Symposium, pp. 271–287, 2017.

    Google Scholar 

  25. W. Xu, Y. Qi and D. Evans, Automatically evading classifiers: A case study on PDF malware classifiers, Proceedings of the Twenty-Third Annual Network and Distributed Systems Security Symposium, 2016.

    Google Scholar 

  26. M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu and W. Huang, Malicious document detection for business process management based on a multi-layer abstract model, Future Generation Computer Systems, vol. 99, pp. 517–526, 2019.

    Google Scholar 

  27. J. Zhang, MLPdf: An effective machine learning based approach for PDF malware detection, presented at Black Hat USA, 2018.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kam-Pui Chow .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, J. et al. (2021). DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics XVII. DigitalForensics 2021. IFIP Advances in Information and Communication Technology, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-030-88381-2_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88381-2_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88380-5

  • Online ISBN: 978-3-030-88381-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics