Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features
<p>The basic structure of PDFs.</p> "> Figure 2
<p>Model framework.</p> "> Figure 3
<p>(<b>a</b>) Frequency of characters in malicious PDFs; (<b>b</b>) number of characters in each type.</p> "> Figure 4
<p>Frame of EMBORAM module.</p> "> Figure 5
<p>AdaBoost-optimized random forest classification model.</p> "> Figure 6
<p>Robustness-optimized support vector machine classification model.</p> "> Figure 7
<p>(<b>a</b>) Effect of the threshold K on accuracy; (<b>b</b>) effect of threshold K on the number of feature sets.</p> "> Figure 8
<p>(<b>a</b>) Effect of threshold B on accuracy; (<b>b</b>) effect of threshold B on the number of samples in the boundary set.</p> "> Figure 9
<p>(<b>a</b>) Comparison of detection results in malicious samples; (<b>b</b>) comparison of detection results in malicious-high samples.</p> "> Figure 10
<p>(<b>a</b>) Comparison of TP values; (<b>b</b>) comparison of FP values; (<b>c</b>) comparison of FN values.</p> ">
Abstract
:1. Introduction
- (1)
- We propose a multi-feature fusion extraction method. It not only extracts basic features such as PDF document objects, physical structure, and content stream but also common dangerous features based on frequency induction, which solves the problem of a single detection target faced by conventional detection methods and effectively resists obfuscation and encryption.
- (2)
- We present a feature-selection-filtering module: EMBORAM. Combined with the weights of features calculated by RReliefF and correlation of features and tags calculated by MIC, the best feature set generated by the entropy method can resist anti-detection methods such as data filling and imitation attacks.
- (3)
- We construct a double-layer processing framework through the AdaBoost-optimized random forest and the robustness-optimized support vector machine classification model. The deficiencies of models in PDF document detection are improved and the detection effect is enhanced through optimization and combination.
- (4)
- We collect a large amount of data with a comprehensive coverage of the training samples. Experiments verify that the detection accuracy rate is high at low time consumption. As the result shows, this model can efficiently detect malicious PDFs adopting common attack methods.
2. Background and Related Work
2.1. PDF Background
2.2. Static Detection Method
2.3. Dynamic Detection Method
3. Double-Layer Detection Model Based on Entropy Method with Multiple Features
3.1. Overview of the Model
3.2. Basic Feature Extraction Module
3.3. Dangerous Feature Extraction Module
3.4. EMBORAM Module
3.5. Optimization Training Module
3.5.1. AdaBoost-Optimized Random Forest Classification Model
- (1)
- We use the dataset and features to build the initial decision trees in the random forest.
- (2)
- The classification weight of the decision trees for each category is calculated based on the classification ability. We initialize training sample weights, which means the weight of decision tree for sample , as in the following equation.
- (a)
- We calculate the error rate of the decision tree to the category , as in the following equation.
- (b)
- If , the decision tree will be dropped and this cycle will be finished. Because AdaBoost is designed to binary classification algorithm, which requires the error rate to be less than (random guessing probability).
- (c)
- If , we calculate the weight of the decision tree for the category , as in the following equation.
- (d)
- We can update the weight of decision tree in the random forest, as in the following equation.
- (3)
- We sequentially set and repeat step (2). The final voting weights of the decision trees for two categories can be obtained.
3.5.2. Robust Optimized Support Vector Machine Classification Model
3.5.3. Double-Layer Processing Framework
4. Experiment and Test
4.1. Dataset
4.2. Environment
4.3. Procedure
4.4. The Impact of Feature Selection Proportion Threshold
4.5. Filtered Features
4.6. The Impact of the Boundary Sample Selection Threshold
4.7. Evaluation Indicators
4.8. Contrast Model
4.9. Evaluation of Competency
4.10. Time Overhead
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Feature Type | Symbols |
---|---|
Objects | obj, endobj, /Type, /FirstChar, /LastChar, /Widths, /BaseFont, /Subtype, /Ascent, /CapHeight, /Descent, /ItalicAngle, /StemV, /MissingWidth, /CharSet, /Rotate, /Resources, /Contents |
Physical structure | %PDF, xref, startxref, trailer, /Size, /Prev, /Encrypt, /ID, %%EOF |
Logical structure | /Root, /Catalog, /Pages, /Kids, /Parent, /Count, /Outlines, /Prev, /Next, /First, /Last |
Content stream | stream, endstream, /Filter, /FlateDecode, /Length, /DecodeParms, |
Metadata | /Info, /Author, /Producer, /Creator, /Keywords, /Title, /Subject, /ModDate, /CreationDate |
Appendix B
Basic Feature Type | Feature Content |
---|---|
Objects (46) | num_obj, num_endobj, num_type, num_diff_type, maxnum_same_type, num_firstchar, num_lastchar, num_width, num_BaseFont, num_Subtype, num_Ascent, num_CapHeight, num_Descent, num_ItalicAngle, num_StemV, num_MissingWidth, num_CharSet, num_Rotate, num_Resources, num_Contents, loc_obj, oc_endobj, loc_firstchar, lloc_lastchar, loc_width, loc_BaseFont, loc_Subtype, loc_Ascent, loc_CapHeight, loc_CharSet, loc_Rotate, loc_Resources, loc_Contents, len_type, len_firstchar, len_lastchar, len_width, len_BaseFont, len_Subtype, len_Ascent, len_CapHeight, len_Descent, len_CharSet, len_Rotate, len_Resources, len_Contents |
Physical structure (19) | version_pdf, num_xref, num_startxref, num_trailer, num_Size, num_Prev, num_Encrypt, num_ID, loc_xref, loc_startxref, loc_trailer, loc_Size, loc_Prev, loc_Encrypt, loc_ID, loc_EOF, len_xref, num_xref_modified, num_xref_initial |
Logical structure (26) | num_Root, num_Catalog, num_Pages, num_Kids, num_Parent, num_Count, num_Outlines, num_Prev, num_Next, num_First, num_Last, loc_Root, loc_Catalog, loc_Pages, loc_Kids, loc_Parent, loc_Count, loc_Outlines, loc_Prev, loc_Next, loc_First, loc_Last, sum_Kids, sum_Parent, sum_Prev, sum_Next |
Content stream (16) | num_stream, num_endstream, num_Filter, num_FlateDecode, num_Length, num_DecodeParms, len_stream, diff_num_Filter, diff_num_FlateDecode, diff_num_DecodeParms, loc_stream, loc_endstream, loc_Filter, loc_FlateDecode, loc_Length, loc_DecodeParms |
Metadata feature (23) | len_Info, len_Author, len_Producer, len_Creator, len_Keywords, len_Title, len_Subject, int_ModDate, int_CreationDate, num_Info, num_Author, num_Producer, num_Creator, num_Keywords, num_Title, num_Subject, loc_Info, loc_Author, loc_Producer, loc_Creator, loc_Keywords, loc_Title, loc_Subject |
Appendix C
Feature Type (Number) | Symbols |
---|---|
Basic attribute (26) | number of \x, +, space, character, string, getElementById, =, keyword, hex, int, encode, \, |, %, (), #, ., ’, [], {}, !, ;, length of shortest string, longest string, code line; proportion of keyword |
Redirection (6) | number of referrer, setTimeout, replace, url, reload, location |
Suspicious keyword (23) | number of exe, js, php, cmd, Wscript, string (length > 30), addEventListener, ActiveXObject, iframe, search, onbeforeunload, onbeforeload, setAttribute, fireEvent, dispatchEvent, onmouseover, onunload, onerror, classid, SystemRoot, attachEvent, createElement |
Confusion (27) | number of charCodeAt, function, var, charAt, write, concat, fromCharCode, escape, eval, substring, indexOf, parseInt, toString, decode, random, log, split, heapspray, proportion of space, %, |, \, ;, (), hex, int, encode |
References
- Yu, M.; Jiang, J.G.; Li, G.; Liu, C.; Huang, W.Q.; Song, N. A Survey of Research on Malicious Document Detection. J. Cyber Secur. 2021, 6, 54–76. [Google Scholar]
- Nissim, N.; Cohen, A.; Moskovitch, R.; Shabtai, A.; Edry, M.; Bar-Ad, O.; Elovici, Y. ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files. In Proceedings of the 2014 IEEE Joint Intelligence and Security Informatics Conference, The Hague, The Netherlands, 24–26 September 2014; pp. 91–98. [Google Scholar]
- Wang, Y. The De-Obfuscation Method in the Static Detection of Malicious PDF Documents. In Proceedings of the 7th Annual International Conference on Network and Information Systems for Computers, Guiyang, China, 23–25 July 2021; pp. 44–47. [Google Scholar]
- Lei, J.; Yi, P.; Chen, X.; Wang, L.; Mao, M. PDF Document Detection Model Based on System Calls and Data Provenance. J. Comput. Appl. 2022, 42, 3831. [Google Scholar]
- Lu, X.; Wang, F.; Jiang, C.; Lio, P. A Universal Malicious Documents Static Detection Framework Based on Feature Generalization. Appl. Sci. 2021, 11, 12134. [Google Scholar] [CrossRef]
- Maiorca, D.; Giacinto, G.; Corona, I. A pattern recognition system for malicious PDF files detection. In International Conference on Machine Learning and Data Mining in Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2012; pp. 510–524. [Google Scholar]
- Cohen, A.; Nissim, N.; Rokach, L.; Elovici, Y. SFEM: Structural Feature Extraction Methodology for the Detection of Malicious Office Documents Using Machine Learning Methods. Expert Syst. Appl. 2016, 63, 324–343. [Google Scholar] [CrossRef]
- Wang, L.N.; Tan, C.; Yu, R.W. The Malware Detection Based on Data Breach Actions. J. Comput. Res. Dev. 2017, 54, 1537–1548. [Google Scholar]
- Feng, D.; Yu, M.; Wang, Y. Detecting Malicious PDF Files Using Semi-Supervised Learning Method. In Proceedings of the International Conference on Advanced Computer Science Applications and Technologies, Beijing, China, 25–26 March 2017. [Google Scholar]
- Corona, I.; Maiorca, D.; Ariu, D.; Giacinto, G. Lux0R: Detection of malicious PDF-embedded JavaScript code through discriminant analysis of API references. In Proceedings of the Workshop on Artificial Intelligent and Security Workshop, Scottsdale, AZ, USA, 3–7 November 2014; pp. 47–57. [Google Scholar]
- Tzermias, Z.; Sykiotakis, G.; Polychronakis, M.; Markatos, E.P. Combining static and dynamic analysis for the detection of malicious documents. In Proceedings of the Fourth European Workshop on System Security; ACM: New York, NY, USA, 2011; pp. 1–6. [Google Scholar]
- Maiorca, D.; Ariu, D.; Corona, I.; Giacinto, G. An evasion resilient approach to the detection of malicious PDF files. In Proceedings of the 2015 International Conference on Information Systems Security and Privacy, Angers, France, 9–11 February 2015; Springer: Cham, Switzerland, 2015; pp. 68–85. [Google Scholar]
- Du, X.; Lin, Y.; Sun, Y. Malicious PDF document detection based on mixed feature. J. Commun. 2019, 40, 118–128. [Google Scholar]
- Vatamanu, C.; Gavriluţ, D.; Benchea, R. A practical approach on clustering malicious PDF documents. J. Comput. Virol. 2012, 8, 151–163. [Google Scholar] [CrossRef]
- Maiorca, D.; Ariu, D.; Corona, I.; Giacinto, G. A Structural and Content-Based Approach for a Precise and Robust Detection of Malicious PDF Files. In Proceedings of the 1st International Conference on Information Systems Security and Privacy, Angers, France, 9–11 February 2015; pp. 27–36. [Google Scholar]
- Lu, X.; Zhuge, J.; Wang, R.; Cao, Y.; Chen, Y. De-Obfuscation and Detection of Malicious PDF Files with High Accuracy. In Proceedings of the 46th Hawaii International Conference on System Sciences, Wailea, HI, USA, 7–10 January 2013; pp. 4890–4899. [Google Scholar]
- ISO 32000-2:2020. Available online: https://www.pdfa.org/resource/iso-32000-pdf/ (accessed on 1 December 2022).
- Šrndic, N.; Laskov, P. Detection of malicious pdf files based on hierarchical document structure. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 25–27 February 2013. [Google Scholar]
- Jose, T.S.; Santos, D.L. Malicious PDF Documents Detection using Machine Learning Techniques. In Proceedings of the 4th International Conference on Information Systems Security and Privacy, Madeira, Portugal, 22–24 January 2018. [Google Scholar]
- Laskov, P. Static detection of malicious JavaScript-bearing PDF documents. In Proceedings of the Twenty-Seventh Computer Security Applications Conference, Orlando, FL, USA, 5–9 December 2011; pp. 373–382. [Google Scholar]
- Nedim, Š.; Pavel, L. Practical Evasion of a Learning-Based Classifier: A Case Study. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, San Jose, CA, USA, 17–18 May 2014; pp. 197–211. [Google Scholar]
- Chandran, P.P.; Jeyakarthic, M. Jeyakarthic: Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model. In Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing, Salem, India, 9–11 May 2022; pp. 1273–1279. [Google Scholar]
- Šrndić, N.; Laskov, P. Hidost: A Static Machine-Learning-Based Detector of Malicious Files. EURASIP J. Inf. Secur. 2016, 2016, 22. [Google Scholar] [CrossRef] [Green Version]
- Wen, W.; Wang, Y.; Meng, Z. PDF file vulnerability detection. J. Tsinghua Univ. (Sci. Technol.) 2017, 57, 33–38. [Google Scholar]
- Falah, A.; Pan, L.; Huda, S.; Pokhrel, S.R.; Anwar, A. Improving malicious PDF classifier with feature engineering: A data-driven approach. Future Gener. Comput. Syst. 2021, 115, 314–326. [Google Scholar] [CrossRef]
- Iwamoto, K.; Wasaki, K. A Method for Shellcode Extraction from Malicious Document Files Using Entropy and Emulation. Int. J. Eng. Technol. 2016, 8, 101–106. [Google Scholar] [CrossRef] [Green Version]
- Xu, M.; Kim, T. Plat Pal: Detecting malicious documents with platform diversity. In Proceedings of the USENIX Security Symposium, Vancouver, BC, Canada, 16–18 August 2017; pp. 271–287. [Google Scholar]
- Liu, D.; Wang, H.; Stavrou, A. Detecting malicious javascript in pdf through document instrumentation. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA, 23–26 June 2014; pp. 100–111. [Google Scholar]
- Yu, M.; Jiang, J.; Li, G.; Li, J.; Lou, C.; Liu, C.; Huang, W.; Wang, Y. A Unified Malicious Documents Detection Model Based on Two Layers of Abstraction. In Proceedings of the IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th In-ternational Conference on Smart City; IEEE 5th International Conference on Data Science and Systems, Zhangjiajie, China, 10–12 August 2019; pp. 2317–2323. [Google Scholar]
- PeePDF. Available online: https://github.com/jesparza/peepdf (accessed on 18 December 2022).
- PDFParser. Available online: https://github.com/smalot/pdfparser (accessed on 11 January 2023).
- PDFTear. Available online: https://github.com/Cryin/PDFTear (accessed on 1 December 2022).
- PDFRate. Available online: https://github.com/csmutz/pdfrate (accessed on 12 January 2023).
- Fernandez, F. Heuristic engines. In Proceedings of the 20th Virus Bulletin Conference, Prague, Czech Republic, 28–30 September 2022; pp. 407–444. [Google Scholar]
- Christmann, A.; Steinwart, I. On Robustness Properties of Convex Risk Minimization Methods for PatternRecognition. J. Mach. Learn. Res. 2004, 5, 1007–1034. [Google Scholar]
- Google. Available online: https://www.google.com.hk/ (accessed on 2 December 2022).
- Yahoo. Available online: https://www.yahoo.com/ (accessed on 5 December 2022).
- React-pdf. Available online: https://github.com/wojtekmaj/react-pdf/ (accessed on 2 March 2023).
Number | Vulnerability | Dangerous Effects |
---|---|---|
CVE-2022-27787 | Buffer overflow | Arbitrary code execution |
CVE-2021-44709 | Buffer overflow | Arbitrary code execution |
CVE-2021-28564 | Buffer overflow | Arbitrary code execution |
CVE-2021-21017 | Buffer overflow | Arbitrary code execution |
CVE-2021-21045 | Improper access control | Privilege escalation attack |
CVE-2020-9704 | Buffer overflow | Arbitrary code execution |
CVE-2019-8249 | Logical flaw | Arbitrary code execution |
CVE-2019-8066 | Buffer overflow | Arbitrary code execution |
CVE-2018-19716 | Buffer overflow | Arbitrary code execution |
Tool | Features Included | Features Not Included |
---|---|---|
PeePDF | Content stream | Objects, physical structure, logical structure, metadata |
PDFParser | Objects, metadata | Physical structure, logical structure, content stream |
PDFTear | Content stream | Objects, physical structure, logical structure, metadata |
PDFRate | Content stream, metadata | Objects, physical structure, logical structure |
This paper | Objects, physical structure, logical structure, content stream, metadata |
Number | Vulnerability Causes | Hazard Impact |
---|---|---|
CVE-2022-34230 | UAF attack | Arbitrary code execution |
CVE-2022-27793 | Command Injection | Arbitrary code execution |
CVE-2022-27791 | Buffer overflow | Arbitrary code execution |
CVE-2021-44711 | Buffer overflow | Arbitrary code execution |
CVE-2021-44703 | Buffer overflow | Arbitrary code execution |
CVE-2021-39863 | Buffer overflow | Arbitrary code execution |
CVE-2019-8014 | Buffer overflow | Arbitrary code execution |
CVE-2017-16398 | UAF attack | Arbitrary code execution |
CVE-2011-0618 | Buffer overflow | Arbitrary code execution |
CVE-2010-2883 | Buffer overflow | Arbitrary code execution |
Type | Quantity < 100 KB | Quantity 100–1000 KB | Quantity > 1000 KB | Sum |
---|---|---|---|---|
Benign | 3918 | 1581 | 796 | 6295 |
Malicious | 3637 | 1125 | 733 | 5495 |
Malicious-high | 175 | 574 | 325 | 1074 |
Feature Attribute | Feature Category | Selected Number | Filter Number | Filtered Proportion |
---|---|---|---|---|
Basic Features | Objects | 23 | 23 | 0.50 |
Physical structure | 14 | 5 | 0.26 | |
Logical structure | 20 | 6 | 0.23 | |
Content stream | 13 | 3 | 0.19 | |
Metadata feature | 21 | 2 | 0.09 | |
Dangerous Features | Basic attribute | 20 | 23 | 0.53 |
Redirection | 5 | 1 | 0.17 | |
Suspicious keyword | 18 | 5 | 0.22 | |
Confusion | 19 | 8 | 0.30 |
Model Category | Number | Model Name | Malicious Samples | Malicious-High Samples | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | Fscore | Precision | Recall | Fscore | |||
Category 1 | (1) | Tra_RF | 0.897 | 0.869 | 0.874 | 0.818 | 0.846 | 0.835 |
(2) | Tra_SVM | 0.891 | 0.877 | 0.881 | 0.814 | 0.845 | 0.831 | |
Category 2 | (3) | Imp_RF | 0.924 | 0.931 | 0.929 | 0.879 | 0.891 | 0.883 |
(4) | Imp_SVM | 0.926 | 0.939 | 0.933 | 0.875 | 0.889 | 0.881 | |
Category 3 | (5) | Tra_AdaRF | 0.923 | 0.935 | 0.926 | 0.870 | 0.883 | 0.877 |
(6) | Tra_RobSVM | 0.920 | 0.937 | 0.929 | 0.862 | 0.881 | 0.873 | |
Category4 | (7) | Smutz | 0.903 | 0.899 | 0.901 | 0.822 | 0.841 | 0.836 |
(8) | Srndic | 0.917 | 0.909 | 0.912 | 0.836 | 0.859 | 0.847 | |
Category 5 | (9) | Paper | 0.959 | 0.988 | 0.971 | 0.906 | 0.928 | 0.919 |
Training Phase | Time Consumption (s) |
---|---|
Basic feature extraction | 1973 |
Dangerous feature extraction | 517 |
EMBORAM | 1.82 |
Model optimization training | 3.79 |
Training Phase | Time Consumption (ms) |
---|---|
Basic feature extraction | 0.76 |
Dangerous feature extraction | 0.21 |
Model Detection | 0.33 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, E.; Hu, T.; Yi, P.; Wang, W. Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features. Entropy 2023, 25, 1099. https://doi.org/10.3390/e25071099
Song E, Hu T, Yi P, Wang W. Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features. Entropy. 2023; 25(7):1099. https://doi.org/10.3390/e25071099
Chicago/Turabian StyleSong, Enzhou, Tao Hu, Peng Yi, and Wenbo Wang. 2023. "Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features" Entropy 25, no. 7: 1099. https://doi.org/10.3390/e25071099
APA StyleSong, E., Hu, T., Yi, P., & Wang, W. (2023). Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features. Entropy, 25(7), 1099. https://doi.org/10.3390/e25071099