Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3576915.3616589acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

Published: 21 November 2023 Publication History

Abstract

Many studies have proposed machine-learning (ML) models for malware detection and classification, reporting an almost-perfect performance. However, they assemble ground-truth in different ways, use diverse static- and dynamic-analysis techniques for feature extraction, and even differ on what they consider a malware family. As a consequence, our community still lacks an understanding of malware classification results: whether they are tied to the nature and distribution of the collected dataset, to what extent the number of families and samples in the training dataset influence performance, and how well static and dynamic features complement each other.
This work sheds light on those open questions by investigating the impact of datasets, features, and classifiers on ML-based malware detection and classification. For this, we collect the largest balanced malware dataset so far with 67k samples from 670 families (100 samples each), and train state-of-the-art models for malware detection and family classification using our dataset. Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features. We discover no correlation between packing and classification accuracy, and that missing behaviors in dynamically-extracted features highly penalise their performance. We also demonstrate how a larger number of families to classify makes the classification harder, while a higher number of samples per family increases accuracy. Finally, we find that models trained on a uniform distribution of samples per family better generalize on unseen data.

References

[1]
2023. DecodingMLSecretsOfWindowsMalwareClassification. https://github.c om/eurecom-s3/DecodingMLSecretsOfWindowsMalwareClassification.
[2]
Accessed September 20, 2023. Chocolatey, the Package Manager for Windows. https://chocolatey.org/.
[3]
Accessed September 20, 2023. Detect-It-Easy. https://github.com/horsicq/Detect- It-Easy.
[4]
Accessed September 20, 2023. JuanLesPIN. https://github.com/Maff1t/JuanLesP IN-Public.
[5]
Accessed September 20, 2023. LordNoteworthy/al-khaser. https://github.com/L ordNoteworthy/al-khaser.
[6]
Accessed September 20, 2023. Proxmox Virtual Environment. https://www.prox mox.com/en/proxmox-ve.
[7]
Accessed September 20, 2023. Yara patterns of RetDec. https://github.com/avast /retdec/tree/master/support/yara_patterns.
[8]
Hojjat Aghakhani, Fabio Gritti, Francesco Mecca, Martina Lindorfer, Stefano Ortolani, Davide Balzarotti, Giovanni Vigna, and Christopher Kruegel. 2020. When Malware is Packin'Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In NDSS Symposium 2020.
[9]
Hyrum S Anderson and Phil Roth. 2018. Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018).
[10]
Simone Aonzo, Yufei Han, Alessandro Mantovani, and Davide Balzarotti. 2022. Humans vs. Machines in Malware Classification. In To appear in Usenix Security 2023.
[11]
Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck. [n.,d.]. Dos and Dontextquoterightts of Machine Learning in Computer Security. In USENIX Security 22.
[12]
Federico Barbero, Feargus Pendlebury, Fabio Pierazzi, and Lorenzo Cavallaro. 2022. Transcending Transcend: Revisiting Malware Classification in the Presence of Concept Drift. In IEEE Symposium on Security and Privacy (Oakland).
[13]
Capstone. 2022. Capstone - The ultimate disassembly framework. https://www.capstone-engine.org/.
[14]
Microsoft Corporation. 2022. PE Format. https://docs.microsoft.com/en-us/windows/win32/debug/pe-format.
[15]
George E. Dahl, Jack W. Stokes, Li Deng, and Dong Yu. 2013. Large-Scale Malware Classification using Random Projections and Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing.
[16]
Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino Vitale, Juan Caballero, Davide Balzarotti, and Leyla Bilge. 2023. Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance. arXiv preprint arXiv:2307.14657 (2023).
[17]
Nicola Galloro, Mario Polino, Michele Carminati, Andrea Continella, and Stefano Zanero. 2022. A Systematical and longitudinal study of evasive behaviors in windows malware. Computers & Security, Vol. 113 (2022), 102550.
[18]
Weijie Han, Jingfeng Xue, Yong Wang, Lu Huang, Zixiao Kong, and Limin Mao. 2019a. MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics. computers & security, Vol. 83 (2019), 208--233.
[19]
Weijie Han, Jingfeng Xue, Yong Wang, Zhenyan Liu, and Zixiao Kong. 2019b. MalInsight: A systematic profiling based malware detection framework. Journal of Network and Computer Applications (2019).
[20]
Wenyi Huang and Jack W. Stokes. 2016. MtNet: A Multi-Task Neural Network for Dynamic Malware Classification. In DIMVA '16.
[21]
AV-TEST Institute. 2023. New Malware. https://www.av-test.org/en/statistics/malware/.
[22]
Chani Jindal, Christopher Salls, Hojjat Aghakhani, Keith Long, Christopher Kruegel, and Giovanni Vigna. 2019. Neurlux: Dynamic Malware Analysis without Feature Engineering. In Annual Computer Security Applications Conference.
[23]
Roberto Jordaney, Kumar Sharad, Santanu Kumar Dash, Zhi Wang, Davide Papini, Ilia Nouretdinov, and Lorenzo Cavallaro. [n.,d.]. Transcend: Detecting Concept Drift in Malware Classification Models. In USENIX Security 17.
[24]
Robert J Joyce, Dev Amlani, Charles Nicholas, and Edward Raff. 2022. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. In Workshop on Artificial Intelligence for Cyber Security.
[25]
Kesav Kancherla and Srinivas Mukkamala. 2013. Image visualization based malware detection. In IEEE Symposium on Computational Intelligence in Cyber Security.
[26]
ElMouatez Billah Karbab and Mourad Debbabi. 2019. MalDy: Portable, data-driven malware detection using natural language processing and machine learning techniques on behavioral analysis reports. (2019).
[27]
Kaspersky. 2023. PCybercriminals attack users with 400,000 new malicious files daily. https://www.kaspersky.com/about/press-releases/2022_cybercriminals-attack-users-with-400000-new-malicious-files-daily-that-is-5-more-than-in-2021.
[28]
Alexander Kuechler, Alessandro Mantovani, Yufei Han, Leyla Bilge, and Davide Balzarotti. [n.,d.]. Does Every Second Count? Time-based Evolution of Malware Behavior in Sandboxes (NDSS 21).
[29]
Shinho Lee, Wookhyun Jung, Wonrak Lee, Hyung Geun Oh, and Eui Tak Kim. 2021. Android malware dataset construction methodology to minimize bias-variance tradeoff. ICT Express (2021).
[30]
Chia Chin Lip and Dzati Athiar Ramli. 2012. Comparative Study on Feature, Score and Decision Level Fusion Schemes for Robust Multibiometric Systems.
[31]
Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. 2020. Energy-Based out-of-Distribution Detection (NIPS'20).
[32]
Nicola Loi, Claudio Borile, and Daniele Ucci. 2021. Towards an Automated Pipeline for Detecting and Classifying Malware through Machine Learning. https://arxiv.org/abs/2106.05625
[33]
Robert Lyda and James Hamrock. 2007. Using entropy analysis to find encrypted and packed malware. IEEE Security & Privacy, Vol. 5, 2 (2007), 40--45.
[34]
Lorenzo Maffia, Dario Nisi, Platon Kotzias, Giovanni Lagorio, Simone Aonzo, and Davide Balzarotti. 2021. Longitudinal Study of the Prevalence of Malware Evasive Techniques. arXiv preprint arXiv:2112.11289 (2021).
[35]
Brad Miller, Alex Kantchelian, Michael Carl Tschantz, Sadia Afroz, Rekha Bachwani, Riyaz Faizullabhoy, Ling Huang, Vaishaal Shankar, Tony Wu, George Yiu, Anthony D. Joseph, and J. D. Tygar. 2016. Reviewer Integration and Performance Measurement for Malware Detection. In DIMVA.
[36]
Najmeh Miramirkhani, Mahathi Priya Appini, Nick Nikiforakis, and Michalis Polychronakis. [n.,d.]. Spotless sandboxes: Evading malware analysis systems using wear-and-tear artifacts. In 2017 IEEE Symposium on Security and Privacy (SP).
[37]
Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, and Jian Zhang. 2011. A comparative assessment of malware classification using binary texture analysis and dynamic analysis. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence.
[38]
Marek Pawlicki, Michał Chora's, Rafał Kozik, and Witold Hołubowicz. 2021. Missing and Incomplete Data Handling in Cybersecurity Applications. In Intelligent Information and Database Systems.
[39]
Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019. TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. In USENIX Security Symposium.
[40]
Roberto Perdisci, Andrea Lanzi, and Wenke Lee. 2008. McBoost: Boosting Scalability in Malware Collection and Analysis using Statistical Classification of Executables. In Annual Computer Security Applications Conference.
[41]
Marco Pontello. 2021. TrID - File Identifier. http://mark0.net/soft-trid-e.html.
[42]
J. Ross Quinlan. 1986. Induction of decision trees. Machine learning (1986).
[43]
Dima Rabadi and Sin G Teo. 2020. Advanced windows methods on malware detection and classification. In Annual Computer Security Applications Conference.
[44]
Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and Charles K Nicholas. 2018. Malware Detection by Eating a Whole EXE. In Workshops at the AAAI Conference on Artificial Intelligence.
[45]
Matilda Rhode, Pete Burnap, and Kevin Jones. 2018. Early-stage malware prediction using recurrent neural networks. computers & security, Vol. 77 (2018), 578--594.
[46]
Konrad Rieck, Thorsten Holz, Carsten Willems, Patrick Düssel, and Pavel Laskov. 2008. Learning and Classification of Malware Behavior. In Detection of Intrusions and Malware, and Vulnerability Assessment.
[47]
Christian Rossow, Christian J Dietrich, Chris Grier, Christian Kreibich, Vern Paxson, Norbert Pohlmann, Herbert Bos, and Maarten Van Steen. 2012. Prudent practices for designing malware experiments: Status quo and outlook. In 2012 IEEE symposium on security and privacy. IEEE, 65--79.
[48]
Zahra Salehi, Ashkan Sami, and Mahboobe Ghiasi. 2017. MAAR: Robust features to detect malicious activity based on API calls, their arguments and return values. Engineering Applications of Artificial Intelligence, Vol. 59 (2017), 93--102.
[49]
Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. 2013. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, Vol. 231 (2013), 64--82.
[50]
Igor Santos, Jaime Devesa, Felix Brezo, Javier Nieves, and Pablo Garcia Bringas. 2012. OPEM: A Static-Dynamic Approach for Machine-learning-based Malware Detection. In International joint conference CISIS.
[51]
Joshua Saxe and Konstantin Berlin. 2015. Deep Neural Network Based Malware Detection using two Dimensional Binary Program Features. In International Conference on Malicious and Unwanted Software.
[52]
Silvia Sebastián and Juan Caballero. 2020. AVclass2: Massive Malware Tag Extraction from AV Labels. In Annual Computer Security Applications Conference.
[53]
M Zubair Shafiq, S Momina Tabish, Fauzan Mirza, and Muddassar Farooq. 2009. Pe-miner: Mining structural information to detect malicious executables in realtime. In International workshop on recent advances in intrusion detection.
[54]
Michael R Smith, Nicholas T Johnson, Joe B Ingram, Armida J Carbajal, Bridget I Haus, Eva Domschot, Ramyaa Ramyaa, Christopher C Lamb, Stephen J Verzi, and W Philip Kegelmeyer. 2020. Mind the gap: On bridging the semantic gap between machine learning and malware analysis. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security.
[55]
Nazgol Tavabi, Andres Abeliuk, Negar Mokhberian, Jeremy Abramson, and Kristina Lerman. [n.,d.]. Challenges in Forecasting Malicious Events from Incomplete Data (WWW '20).
[56]
G. V. Trunk. 1979. A Problem of Dimensionality: A Simple Example. IEEE Transactions on Pattern Analysis and Machine Intelligence (1979).
[57]
Danish Vasan, Mamoun Alazab, Sobia Wassan, Hamad Naeem, Babak Safaei, and Qin Zheng. 2020. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. (2020).
[58]
vtfeedapi Accessed September 20, 2023. VirusTotal API 2.0 Reference: File Feed. https://developers.virustotal.com/v2.0/reference/file-feed.
[59]
jiezhong xiao, qian han, and yumeng gao. 2022. Hybrid Classification and Clustering Algorithm on Recent Android Malware Detection (CSAI 2021). Association for Computing Machinery.
[60]
Miuyin Yong Wong, Matthew Landen, Manos Antonakakis, Douglas M Blough, Elissa M Redmiles, and Mustaque Ahamad. 2021. An Inside Look into the Practice of Malware Analysis. In ACM CCS 21.
[61]
Hao Zhang, Wenjun Zhang, Zhihan Lv, Arun Kumar Sangaiah, Tao Huang, and Naveen Chilamkurti. 2020b. MALDC: a depth detection method for malware based on behavior chains. World Wide Web, Vol. 23, 2 (2020), 991--1010.
[62]
Zhaoqi Zhang, Panpan Qi, and Wei Wang. 2020a. Dynamic malware analysis with feature engineering and feature learning. In Proceedings of the AAAI Conference on Artificial Intelligence.

Cited By

View all
  • (2024)A Comparison of Neural-Network-Based Intrusion Detection against Signature-Based Detection in IoT NetworksInformation10.3390/info1503016415:3(164)Online publication date: 14-Mar-2024
  • (2024)How to Train your Antivirus: RL-based Hardening through the Problem SpaceProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678912(130-146)Online publication date: 30-Sep-2024
  • (2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Decoding the Secrets of Machine Learning in Malware Classification: A Deep Dive into Datasets, Feature Extraction, and Model Performance

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
    November 2023
    3722 pages
    ISBN:9798400700507
    DOI:10.1145/3576915
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 November 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. machine learning for malware
    2. malware detection
    3. malware family classification

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CCS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

    Upcoming Conference

    CCS '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,556
    • Downloads (Last 6 weeks)145
    Reflects downloads up to 12 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Comparison of Neural-Network-Based Intrusion Detection against Signature-Based Detection in IoT NetworksInformation10.3390/info1503016415:3(164)Online publication date: 14-Mar-2024
    • (2024)How to Train your Antivirus: RL-based Hardening through the Problem SpaceProceedings of the 27th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3678890.3678912(130-146)Online publication date: 30-Sep-2024
    • (2024)AgraBOT: Accelerating Third-Party Security Risk Management in Enterprise Setting through Generative AICompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663829(74-79)Online publication date: 10-Jul-2024
    • (2024)Enhancing Malware Classification via Self-Similarity TechniquesIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.343337219(7232-7244)Online publication date: 25-Jul-2024
    • (2024)Enhancing Android Malware Detection Through Machine Learning: Insights From Permission and Metadata Analysis2024 Third International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN)10.1109/ICSTSN61422.2024.10670984(1-6)Online publication date: 18-Jul-2024
    • (2024)Assessing Static and Dynamic Features for Packing DetectionThe Combined Power of Research, Education, and Dissemination10.1007/978-3-031-73887-6_12(146-166)Online publication date: 23-Oct-2024
    • (2023)Enhancing Machine Learning in Information Security: Power-Law Distribution and Dragon King2023 International Conference on Computer Science and Automation Technology (CSAT)10.1109/CSAT61646.2023.00088(324-327)Online publication date: 6-Oct-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media