research-article

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

Authors:

Platon Kotzias,

Antonino Vitale,

Juan Caballero,

Davide Balzarotti,

Leyla BilgeAuthors Info & Claims

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

Pages 60 - 74

https://doi.org/10.1145/3576915.3616589

Published: 21 November 2023 Publication History

Abstract

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other.

This work sheds light on those open questions by investigating the impact of datasets, features, and classifiers on ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67k samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalise their performance. We also demonstrate how a larger number of families to classify makes the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

References

[1]

2023. DecodingMLSecretsOfWindowsMalwareClassification. https://github.c om/eurecom-s3/DecodingMLSecretsOfWindowsMalwareClassification.

[2]

Accessed September 20, 2023. Chocolatey, the Package Manager for Windows. https://chocolatey.org/.

[3]

Accessed September 20, 2023. Detect-It-Easy. https://github.com/horsicq/Detect- It-Easy.

[4]

Accessed September 20, 2023. JuanLesPIN. https://github.com/Maff1t/JuanLesP IN-Public.

[5]

Accessed September 20, 2023. LordNoteworthy/al-khaser. https://github.com/L ordNoteworthy/al-khaser.

[6]

Accessed September 20, 2023. Proxmox Virtual Environment. https://www.prox mox.com/en/proxmox-ve.

[7]

Accessed September 20, 2023. Yara patterns of RetDec. https://github.com/avast /retdec/tree/master/support/yara_patterns.

[8]

Hojjat Aghakhani, Fabio Gritti, Francesco Mecca, Martina Lindorfer, Stefano Ortolani, Davide Balzarotti, Giovanni Vigna, and Christopher Kruegel. 2020. When Malware is Packin'Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In NDSS Symposium 2020.

[9]

Hyrum S Anderson and Phil Roth. 2018. Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018).

[10]

Simone Aonzo, Yufei Han, Alessandro Mantovani, and Davide Balzarotti. 2022. Humans vs. Machines in Malware Classification. In To appear in Usenix Security 2023.

[11]

Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. [n.,d.]. Dos and Dontextquoterightts of Machine Learning in Computer Security. In USENIX Security 22.

[12]

Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. 2022. Transcending Transcend: Revisiting Malware Classification in the Presence of Concept Drift. In IEEE Symposium on Security and Privacy (Oakland).

[13]

Capstone. 2022. Capstone - The ultimate disassembly framework. https://www.capstone-engine.org/.

[14]

Microsoft Corporation. 2022. PE Format. https://docs.microsoft.com/en-us/windows/win32/debug/pe-format.

[15]

George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. 2013. Large-Scale Malware Classification using Random Projections and Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]

Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino Vitale, Juan Caballero, Davide Balzarotti, and Leyla Bilge. 2023. Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance. arXiv preprint arXiv:2307.14657 (2023).

[17]

Nicola Galloro, Mario Polino, Michele Carminati, Andrea Continella, and Stefano Zanero. 2022. A Systematical and longitudinal study of evasive behaviors in windows malware. Computers & Security, Vol. 113 (2022), 102550.

Digital Library

[18]

Weijie Han, Jingfeng Xue, Yong Wang, Lu Huang, Zixiao Kong, and Limin Mao. 2019a. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. computers & security, Vol. 83 (2019), 208--233.

[19]

Weijie Han, Jingfeng Xue, Yong Wang, Zhenyan Liu, and Zixiao Kong. 2019b. MalInsight: A systematic profiling based malware detection framework. Journal of Network and Computer Applications (2019).

[20]

Wenyi Huang and Jack W. Stokes. 2016. MtNet: A Multi-Task Neural Network for Dynamic Malware Classification. In DIMVA '16.

[21]

AV-TEST Institute. 2023. New Malware. https://www.av-test.org/en/statistics/malware/.

[22]

Chani Jindal, Christopher Salls, Hojjat Aghakhani, Keith Long, Christopher Kruegel, and Giovanni Vigna. 2019. Neurlux: Dynamic Malware Analysis without Feature Engineering. In Annual Computer Security Applications Conference.

[23]

Roberto Jordaney, Kumar Sharad, Santanu Kumar Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. [n.,d.]. Transcend: Detecting Concept Drift in Malware Classification Models. In USENIX Security 17.

[24]

Robert J Joyce, Dev Amlani, Charles Nicholas, and Edward Raff. 2022. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In Workshop on Artificial Intelligence for Cyber Security.

[25]

Kesav Kancherla and Srinivas Mukkamala. 2013. Image visualization based malware detection. In IEEE Symposium on Computational Intelligence in Cyber Security.

[26]

ElMouatez Billah Karbab and Mourad Debbabi. 2019. MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports. (2019).

[27]

Kaspersky. 2023. PCybercriminals attack users with 400,000 new malicious files daily. https://www.kaspersky.com/about/press-releases/2022_cybercriminals-attack-users-with-400000-new-malicious-files-daily-that-is-5-more-than-in-2021.

[28]

Alexander Kuechler, Alessandro Mantovani, Yufei Han, Leyla Bilge, and Davide Balzarotti. [n.,d.]. Does Every Second Count? Time-based Evolution of Malware Behavior in Sandboxes (NDSS 21).

[29]

Shinho Lee, Wookhyun Jung, Wonrak Lee, Hyung Geun Oh, and Eui Tak Kim. 2021. Android malware dataset construction methodology to minimize bias-variance tradeoff. ICT Express (2021).

[30]

Chia Chin Lip and Dzati Athiar Ramli. 2012. Comparative Study on Feature, Score and Decision Level Fusion Schemes for Robust Multibiometric Systems.

[31]

Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. 2020. Energy-Based out-of-Distribution Detection (NIPS'20).

[32]

Nicola Loi, Claudio Borile, and Daniele Ucci. 2021. Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning. https://arxiv.org/abs/2106.05625

[33]

Robert Lyda and James Hamrock. 2007. Using entropy analysis to find encrypted and packed malware. IEEE Security & Privacy, Vol. 5, 2 (2007), 40--45.

Digital Library

[34]

Lorenzo Maffia, Dario Nisi, Platon Kotzias, Giovanni Lagorio, Simone Aonzo, and Davide Balzarotti. 2021. Longitudinal Study of the Prevalence of Malware Evasive Techniques. arXiv preprint arXiv:2112.11289 (2021).

[35]

Brad Miller, Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar, Tony Wu, George Yiu, Anthony D. Joseph, and J. D. Tygar. 2016. Reviewer Integration and Performance Measurement for Malware Detection. In DIMVA.

[36]

Najmeh Miramirkhani, Mahathi Priya Appini, Nick Nikiforakis, and Michalis Polychronakis. [n.,d.]. Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts. In 2017 IEEE Symposium on Security and Privacy (SP).

[37]

Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, and Jian Zhang. 2011. A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence.

Digital Library

[38]

Marek Pawlicki, Michał Chora's, Rafał Kozik, and Witold Hołubowicz. 2021. Missing and Incomplete Data Handling in Cybersecurity Applications. In Intelligent Information and Database Systems.

[39]

Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. In USENIX Security Symposium.

[40]

Roberto Perdisci, Andrea Lanzi, and Wenke Lee. 2008. McBoost: Boosting Scalability in Malware Collection and Analysis using Statistical Classification of Executables. In Annual Computer Security Applications Conference.

[41]

Marco Pontello. 2021. TrID - File Identifier. http://mark0.net/soft-trid-e.html.

[42]

J. Ross Quinlan. 1986. Induction of decision trees. Machine learning (1986).

[43]

Dima Rabadi and Sin G Teo. 2020. Advanced windows methods on malware detection and classification. In Annual Computer Security Applications Conference.

Digital Library

[44]

Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K Nicholas. 2018. Malware Detection by Eating a Whole EXE. In Workshops at the AAAI Conference on Artificial Intelligence.

[45]

Matilda Rhode, Pete Burnap, and Kevin Jones. 2018. Early-stage malware prediction using recurrent neural networks. computers & security, Vol. 77 (2018), 578--594.

[46]

Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and Classification of Malware Behavior. In Detection of Intrusions and Malware, and Vulnerability Assessment.

[47]

Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. 2012. Prudent practices for designing malware experiments: Status quo and outlook. In 2012 IEEE symposium on security and privacy. IEEE, 65--79.

Digital Library

[48]

Zahra Salehi, Ashkan Sami, and Mahboobe Ghiasi. 2017. MAAR: Robust features to detect malicious activity based on API calls, their arguments and return values. Engineering Applications of Artificial Intelligence, Vol. 59 (2017), 93--102.

[49]

Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. 2013. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, Vol. 231 (2013), 64--82.

Digital Library

[50]

Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. 2012. OPEM: A Static-Dynamic Approach for Machine-learning-based Malware Detection. In International joint conference CISIS.

[51]

Joshua Saxe and Konstantin Berlin. 2015. Deep Neural Network Based Malware Detection using two Dimensional Binary Program Features. In International Conference on Malicious and Unwanted Software.

Digital Library

[52]

Silvia Sebastián and Juan Caballero. 2020. AVclass2: Massive Malware Tag Extraction from AV Labels. In Annual Computer Security Applications Conference.

Digital Library

[53]

M Zubair Shafiq, S Momina Tabish, Fauzan Mirza, and Muddassar Farooq. 2009. Pe-miner: Mining structural information to detect malicious executables in realtime. In International workshop on recent advances in intrusion detection.

Digital Library

[54]

Michael R Smith, Nicholas T Johnson, Joe B Ingram, Armida J Carbajal, Bridget I Haus, Eva Domschot, Ramyaa Ramyaa, Christopher C Lamb, Stephen J Verzi, and W Philip Kegelmeyer. 2020. Mind the gap: On bridging the semantic gap between machine learning and malware analysis. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security.

Digital Library

[55]

Nazgol Tavabi, Andres Abeliuk, Negar Mokhberian, Jeremy Abramson, and Kristina Lerman. [n.,d.]. Challenges in Forecasting Malicious Events from Incomplete Data (WWW '20).

[56]

G. V. Trunk. 1979. A Problem of Dimensionality: A Simple Example. IEEE Transactions on Pattern Analysis and Machine Intelligence (1979).

[57]

Danish Vasan, Mamoun Alazab, Sobia Wassan, Hamad Naeem, Babak Safaei, and Qin Zheng. 2020. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. (2020).

[58]

vtfeedapi Accessed September 20, 2023. VirusTotal API 2.0 Reference: File Feed. https://developers.virustotal.com/v2.0/reference/file-feed.

[59]

jiezhong xiao, qian han, and yumeng gao. 2022. Hybrid Classification and Clustering Algorithm on Recent Android Malware Detection (CSAI 2021). Association for Computing Machinery.

[60]

Miuyin Yong Wong, Matthew Landen, Manos Antonakakis, Douglas M Blough, Elissa M Redmiles, and Mustaque Ahamad. 2021. An Inside Look into the Practice of Malware Analysis. In ACM CCS 21.

[61]

Hao Zhang, Wenjun Zhang, Zhihan Lv, Arun Kumar Sangaiah, Tao Huang, and Naveen Chilamkurti. 2020b. MALDC: a depth detection method for malware based on behavior chains. World Wide Web, Vol. 23, 2 (2020), 991--1010.

[62]

Zhaoqi Zhang, Panpan Qi, and Wei Wang. 2020a. Dynamic malware analysis with feature engineering and feature learning. In Proceedings of the AAAI Conference on Artificial Intelligence.

Cited By

Schrötter MNiemann ASchnor B(2024)A Comparison of Neural-Network-Based Intrusion Detection against Signature-Based Detection in IoT NetworksInformation10.3390/info1503016415:3(164)Online publication date: 14-Mar-2024
https://doi.org/10.3390/info15030164
Tsingenopoulos ICortellazzi JBošanský BAonzo SPreuveneers DJoosen WPierazzi FCavallaro L(2024)How to Train your Antivirus: RL-based Hardening through the Problem SpaceProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678912(130-146)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3678890.3678912
Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829
Show More Cited By

Index Terms

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance
1. Security and privacy
  1. Intrusion/anomaly detection and malware mitigation
    1. Malware and its mitigation

Recommendations

A novel malware analysis for malware detection and classification using machine learning algorithms
SIN '17: Proceedings of the 10th International Conference on Security of Information and Networks

Nowadays, Malware has become a serious threat to the digitization of the world due to the emergence of various new and complex malware every day. Due to this, the traditional signature-based methods for detection of malware effectively becomes an ...
A Novel Malware Analysis Framework for Malware Detection and Classification using Machine Learning Approach
ICDCN '18: Proceedings of the 19th International Conference on Distributed Computing and Networking

Nowadays, the digitization of the world is under a serious threat due to the emergence of various new and complex malware every day. Due to this, the traditional signature-based methods for detection of malware effectively become an obsolete method. The ...
Machine Learning and Images for Malware Detection and Classification
PCI '17: Proceedings of the 21st Pan-Hellenic Conference on Informatics

Detecting malicious code with exact match on collected datasets is becoming a large-scale identification problem due to the existence of new malware variants. Being able to promptly and accurately identify new attacks enables security experts to respond ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

November 2023

3722 pages

ISBN:9798400700507

DOI:10.1145/3576915

General Chairs:
Weizhi Meng
Technical University of Denmark
,
Christian D. Jensen
Technical University of Denmark
,
Program Chairs:
Cas Cremers
CISPA Helmholtz Center for Information Security
,
Engin Kirda
Khoury College of Computer Sciences

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Ministerio de Ciencia, Innovación y Universidades
Agence Nationale de la Recherche
European Research Council

Conference

CCS '23

Sponsor:

SIGSAC

CCS '23: ACM SIGSAC Conference on Computer and Communications Security

November 26 - 30, 2023

Copenhagen, Denmark

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
1,556
Total Downloads

Downloads (Last 12 months)1,556
Downloads (Last 6 weeks)145

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schrötter MNiemann ASchnor B(2024)A Comparison of Neural-Network-Based Intrusion Detection against Signature-Based Detection in IoT NetworksInformation10.3390/info1503016415:3(164)Online publication date: 14-Mar-2024
https://doi.org/10.3390/info15030164
Tsingenopoulos ICortellazzi JBošanský BAonzo SPreuveneers DJoosen WPierazzi FCavallaro L(2024)How to Train your Antivirus: RL-based Hardening through the Problem SpaceProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678912(130-146)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3678890.3678912
Toslali MSnible EChen JCha ASingh SKalantar MParthasarathy Sd'Amorim M(2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663829
Zhong FHu QJiang YHuang JZhang CWu D(2024)Enhancing Malware Classification via Self-Similarity TechniquesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343337219(7232-7244)Online publication date: 25-Jul-2024
https://dl.acm.org/doi/10.1109/TIFS.2024.3433372
G SS SG SR KS S P(2024)Enhancing Android Malware Detection Through Machine Learning: Insights From Permission and Metadata Analysis2024 Third International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN)10.1109/ICSTSN61422.2024.10670984(1-6)Online publication date: 18-Jul-2024
https://doi.org/10.1109/ICSTSN61422.2024.10670984
Van Ouytsel CLegay ALucca SWauters D(2024)Assessing Static and Dynamic Features for Packing DetectionThe Combined Power of Research, Education, and Dissemination10.1007/978-3-031-73887-6_12(146-166)Online publication date: 23-Oct-2024
https://doi.org/10.1007/978-3-031-73887-6_12
Wang YChen X(2023)Enhancing Machine Learning in Information Security: Power-Law Distribution and Dragon King2023 International Conference on Computer Science and Automation Technology (CSAT)10.1109/CSAT61646.2023.00088(324-327)Online publication date: 6-Oct-2023
https://doi.org/10.1109/CSAT61646.2023.00088

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents