Abstract
Portable Document Format (PDF) documents are often used as carriers of malicious code that launch attacks or steal personal information. Traditional manual and supervised-learning-based detection methods rely heavily on labeled samples of malicious documents. But this is problematic because very few labeled malicious samples are available in real-world scenarios.
This chapter presents a semi-supervised machine learning method for detecting malicious PDF documents. It extracts structural features as well as statistical features based on entropy sequences using the wavelet energy spectrum. A random sub-sampling strategy is employed to train multiple sub-classifiers. Each classifier is independent, which enhances the generalization capability during detection. The semi-supervised learning method enables labeled as well as unlabeled samples to be used to classify malicious and benign PDF documents. Experimental results demonstrate that the method yields an accuracy of 94% despite using training data with just 11% labeled malicious samples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adobe Systems, Document Management – Portable Document Format – Part 1: PDF 1.7, First Edition 2008-7-1, PDF 32000-1:2008, First Edition 2008-7-1, San Jose, California, 2008.
A. Blonce, E. Filiol and L. Frayssignes, Portable Document Format (PDF) security analysis and malware threats, presented at the Black Hat Europe Conference, 2008.
G. Canfora, F. Mercaldo and C. Visaggio, An HMM and structural entropy based detector for Android malware: An empirical study, Computers and Security, vol. 61, pp. 1–18, 2016.
A. Cohen, N. Nissim, L. Rokach and Y. Elovici, SFEM: Structural feature extraction methodology for the detection of malicious office documents using machine learning methods, Expert Systems with Applications, vol. 63, pp 324–343, 2016.
Contaigo, 16,800 Clean and 11,960 Malicious Files for Signature Testing and Research (contagiodump.blogspot.com/2013/03/16800-clean-and-11960-malicious-files.html), March 24, 2013.
FireEye, Advanced Persistent Threat Groups, Milipitas, California (www.fireeye.com/current-threats/apt-groups.html), 2020.
D. Gibert, C. Mateu, J. Planes and R. Vicens, Classification of malware by using structural entropy on convolutional neural networks, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Thirtieth AAAI Conference on Innovative Applications of Artificial Intelligence and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 7759–7764, 2018.
A. Kaboutari, J. Bagherzadeh and F. Kheradmand, An evaluation of two-step techniques for positive-unlabeled learning in text classification, International Journal of Computer Applications Technology and Research, vol. 3(9), pp. 592–594, 2014.
M. Li, Y. Liu, M. Yu, G. Li, Y. Wang and C. Liu, FEPDF: A robust feature extractor for malicious PDF detection, Proceedings of the IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 218–224, 2017.
J. Lin and H. Pao, Multi-view malicious document detection, Proceedings of the Conference on Technologies and Applications of Artificial Intelligence, pp. 170–175, 2013.
L. Liu, X. He, L. Liu, L. Qing, Y. Fang and J. Liu, Capturing the symptoms of malicious code in electronic documents by file entropy signals combined with machine learning, Applied Soft Computing, vol. 82, article no. 105598, 2019.
X. Lu, F. Wang and Z. Shu, Malicious Word document detection based on multi-view feature learning, Proceedings of the Twenty-Eighth International Conference on Computer Communications and Networks, 2019.
D. Maiorca, D. Ariu, I. Corona and G. Giacinto, A structural and content-based approach for precise and robust detection of malicious PDF files, Proceedings of the International Conference on Information Systems Security and Privacy, pp. 27–36, 2015.
D. Maiorca, G. Giacinto and I. Corona, A pattern recognition system for malicious PDF file detection, Proceedings of the Eighth International Conference on Machine Learning and Data Mining in Pattern Recognition, pp. 510–524, 2012.
F. Mordelet and J. Vert, A bagging SVM to learn from positive and unlabeled examples, Pattern Recognition Letters, vol. 37, pp. 201–209, 2014.
J. Muller, F. Ising, V. Mladenov, C. Mainka, S. Schinzel and J. Schwenk, Practical decryption exfiltration: Breaking PDF encryption, Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 15–29, 2019.
C. Smutz and A. Stavrou, Malicious PDF detection using metadata and structural features, Proceedings of the Twenty-Eighth Annual Computer Security Applications Conference, pp. 239–248, 2012.
N. Srndic and P. Laskov, Detection of malicious PDF files based on hierarchical document structure, Proceedings of the Twentieth Annual Network and Distributed System Security Symposium, 2013.
N. Srndic and P. Laskov, Hidost: A static machine-learning-based detector of malicious files, EURASIP Journal on Information Security, vol. 2016(1), article no. 45, 2016.
J. Torres and S. De Los Santos, Malicious PDF document detection using machine learning techniques, Proceedings of the Fourth International Conference on Information Systems Security and Privacy, pp. 337–344, 2018.
Z. Tzermias, G. Sykiotakis, M. Polychronakis and E. Markatos, Combining static and dynamic analysis for the detection of malicious documents, Proceedings of the Fourth European Workshop on System Security, article no. 4, 2011.
VirusShare, Home (www.virusshare.com), 2020.
VirusTotal, GUI (www.virustotal.com/gui), 2020.
M. Xu and T. Kim, PlatPal: Detecting malicious documents with platform diversity, Proceedings of the Twenty-Sixth USENIX Security Symposium, pp. 271–287, 2017.
W. Xu, Y. Qi and D. Evans, Automatically evading classifiers: A case study on PDF malware classifiers, Proceedings of the Twenty-Third Annual Network and Distributed Systems Security Symposium, 2016.
M. Yu, J. Jiang, G. Li, C. Lou, Y. Liu, C. Liu and W. Huang, Malicious document detection for business process management based on a multi-layer abstract model, Future Generation Computer Systems, vol. 99, pp. 517–526, 2019.
J. Zhang, MLPdf: An effective machine learning based approach for PDF malware detection, presented at Black Hat USA, 2018.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 IFIP International Federation for Information Processing
About this paper
Cite this paper
Jiang, J. et al. (2021). DETECTING MALICIOUS PDF DOCUMENTS USING SEMI-SUPERVISED MACHINE LEARNING. In: Peterson, G., Shenoi, S. (eds) Advances in Digital Forensics XVII. DigitalForensics 2021. IFIP Advances in Information and Communication Technology, vol 612. Springer, Cham. https://doi.org/10.1007/978-3-030-88381-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-88381-2_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88380-5
Online ISBN: 978-3-030-88381-2
eBook Packages: Computer ScienceComputer Science (R0)