DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING

Jianguo Jiang¹⁷,
Nan Song¹⁷,
Min Yu¹⁷,
Kam-Pui Chow¹⁸,
Gang Li¹⁹,
Chao Liu¹⁷ &
…
Weiqing Huang¹⁷

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 612))

Included in the following conference series:

IFIP International Conference on Digital Forensics

548 Accesses
1 Citations

Abstract

Portable Document Format (PDF) documents are often used as carriers of malicious code that launch attacks or steal personal information. Traditional manual and supervised-learning-based detection methods rely heavily on labeled samples of malicious documents. But this is problematic because very few labeled malicious samples are available in real-world scenarios.

This chapter presents a semi-supervised machine learning method for detecting malicious PDF documents. It extracts structural features as well as statistical features based on entropy sequences using the wavelet energy spectrum. A random sub-sampling strategy is employed to train multiple sub-classifiers. Each classifier is independent, which enhances the generalization capability during detection. The semi-supervised learning method enables labeled as well as unlabeled samples to be used to classify malicious and benign PDF documents. Experimental results demonstrate that the method yields an accuracy of 94% despite using training data with just 11% labeled malicious samples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Advanced Detection Tool for PDF Threats

Malicious PDF Files Detection Using Structural and Javascript Based Features

Explainable AI model for PDFMal detection based on gradient boosting model

Article Open access 05 September 2024

References

Adobe Systems, Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition 2008-7-1, PDF 32000-1:2008, First Edition 2008-7-1, San Jose, California, 2008.
Google Scholar
A. Blonce, E. Filiol and L. Frayssignes, Portable Document Format (PDF) security analysis and malware threats, presented at the Black Hat Europe Conference, 2008.
Google Scholar
G. Canfora, F. Mercaldo and C. Visaggio, An HMM and structural entropy based detector for Android malware: An empirical study, Computers and Security, vol. 61, pp. 1–18, 2016.
Google Scholar
A. Cohen, N. Nissim, L. Rokach and Y. Elovici, SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods, Expert Systems with Applications, vol. 63, pp 324–343, 2016.
Google Scholar
Contaigo, 16,800 Clean and 11,960 Malicious Files for Signature Testing and Research (contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html), March 24, 2013.
Google Scholar
FireEye, Advanced Persistent Threat Groups, Milipitas, California (www.fireeye.com/current-threats/apt-groups.html), 2020.
Google Scholar
D. Gibert, C. Mateu, J. Planes and R. Vicens, Classification of malware by using structural entropy on convolutional neural networks, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Thirtieth AAAI Conference on Innovative Applications of Artificial Intelligence and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 7759–7764, 2018.
Google Scholar
A. Kaboutari, J. Bagherzadeh and F. Kheradmand, An evaluation of two-step techniques for positive-unlabeled learning in text classification, International Journal of Computer Applications Technology and Research, vol. 3(9), pp. 592–594, 2014.
Google Scholar
M. Li, Y. Liu, M. Yu, G. Li, Y. Wang and C. Liu, FEPDF: A robust feature extractor for malicious PDF detection, Proceedings of the IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 218–224, 2017.
Google Scholar
J. Lin and H. Pao, Multi-view malicious document detection, Proceedings of the Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, 2013.
Google Scholar
L. Liu, X. He, L. Liu, L. Qing, Y. Fang and J. Liu, Capturing the symptoms of malicious code in electronic documents by file entropy signals combined with machine learning, Applied Soft Computing, vol. 82, article no. 105598, 2019.
Google Scholar
X. Lu, F. Wang and Z. Shu, Malicious Word document detection based on multi-view feature learning, Proceedings of the Twenty-Eighth International Conference on Computer Communications and Networks, 2019.
Google Scholar
D. Maiorca, D. Ariu, I. Corona and G. Giacinto, A structural and content-based approach for precise and robust detection of malicious PDF files, Proceedings of the International Conference on Information Systems Security and Privacy, pp. 27–36, 2015.
Google Scholar
D. Maiorca, G. Giacinto and I. Corona, A pattern recognition system for malicious PDF file detection, Proceedings of the Eighth International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 510–524, 2012.
Google Scholar
F. Mordelet and J. Vert, A bagging SVM to learn from positive and unlabeled examples, Pattern Recognition Letters, vol. 37, pp. 201–209, 2014.
Google Scholar
J. Muller, F. Ising, V. Mladenov, C. Mainka, S. Schinzel and J. Schwenk, Practical decryption exfiltration: Breaking PDF encryption, Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 15–29, 2019.
Google Scholar
C. Smutz and A. Stavrou, Malicious PDF detection using metadata and structural features, Proceedings of the Twenty-Eighth Annual Computer Security Applications Conference, pp. 239–248, 2012.
Google Scholar
N. Srndic and P. Laskov, Detection of malicious PDF files based on hierarchical document structure, Proceedings of the Twentieth Annual Network and Distributed System Security Symposium, 2013.
Google Scholar
N. Srndic and P. Laskov, Hidost: A static machine-learning-based detector of malicious files, EURASIP Journal on Information Security, vol. 2016(1), article no. 45, 2016.
Google Scholar
J. Torres and S. De Los Santos, Malicious PDF document detection using machine learning techniques, Proceedings of the Fourth International Conference on Information Systems Security and Privacy, pp. 337–344, 2018.
Google Scholar
Z. Tzermias, G. Sykiotakis, M. Polychronakis and E. Markatos, Combining static and dynamic analysis for the detection of malicious documents, Proceedings of the Fourth European Workshop on System Security, article no. 4, 2011.
Google Scholar
VirusShare, Home (www.virusshare.com), 2020.
Google Scholar
VirusTotal, GUI (www.virustotal.com/gui), 2020.
Google Scholar
M. Xu and T. Kim, PlatPal: Detecting malicious documents with platform diversity, Proceedings of the Twenty-Sixth USENIX Security Symposium, pp. 271–287, 2017.
Google Scholar
W. Xu, Y. Qi and D. Evans, Automatically evading classifiers: A case study on PDF malware classifiers, Proceedings of the Twenty-Third Annual Network and Distributed Systems Security Symposium, 2016.
Google Scholar
M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu and W. Huang, Malicious document detection for business process management based on a multi-layer abstract model, Future Generation Computer Systems, vol. 99, pp. 517–526, 2019.
Google Scholar
J. Zhang, MLPdf: An effective machine learning based approach for PDF malware detection, presented at Black Hat USA, 2018.
Google Scholar

Download references

Author information

Authors and Affiliations

Cyber Security at the Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Jianguo Jiang, Nan Song, Min Yu, Chao Liu & Weiqing Huang
Computer Science at the University of Hong Kong, Hong Kong, China
Kam-Pui Chow
Information Technology at Deakin University, Burwood, Australia
Gang Li

Authors

Jianguo Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Song
View author publications
You can also search for this author in PubMed Google Scholar
Min Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kam-Pui Chow
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weiqing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kam-Pui Chow .

Editor information

Editors and Affiliations

Department of Electrical and Computer Engineering, Air Force Institute of Technology, Wright-Patterson AFB, OH, USA
Gilbert Peterson
Tandy School of Computer Science, University of Tulsa, Tulsa, OK, USA
Sujeet Shenoi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, J. et al. (2021). DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics XVII. DigitalForensics 2021. IFIP Advances in Information and Communication Technology, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-030-88381-2_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-88381-2_7
Published: 15 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88380-5
Online ISBN: 978-3-030-88381-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships