Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Benchmarking of synthetic network data: : Reviewing challenges and approaches

Published: 01 October 2024 Publication History

Abstract

The development of Network Intrusion Detection Systems (NIDS) requires labeled network traffic, especially to train and evaluate machine learning approaches. Besides the recording of traffic, the generation of traffic via generative models is a promising approach to obtain vast amounts of labeled data. There exist various machine learning approaches for data generation, but the assessment of the data quality is complex and not standardized. The lack of common quality criteria complicates the comparison of synthetic data generation approaches and synthetic data.
Our work addresses this gap in multiple steps. Firstly, we review and categorize existing approaches for evaluating synthetic data in the network traffic domain and other data domains as well. Secondly, based on our review, we compile a setup of metrics that are suitable for the NetFlow domain, which we aggregate into two metrics Data Dissimilarity Score and Domain Dissimilarity Score. Thirdly, we evaluate the proposed metrics on real world data sets, to demonstrate their ability to distinguish between samples from different data sets. As a final step, we conduct a case study to demonstrate the application of the metrics for the evaluation of synthetic data. We calculate the metrics on samples from real NetFlow data sets to define an upper and lower bound for inter- and intra-data set similarity scores. Afterward, we generate synthetic data via Generative Adversarial Network (GAN) and Generative Pre-trained Transformer 2 (GPT-2) and apply the metrics to these synthetic data and incorporate these lower bound baseline results to obtain an objective benchmark. The application of the benchmarking process is demonstrated on three NetFlow benchmark data sets, NF-CSE-CIC-IDS2018, NF-ToN-IoT and NF-UNSW-NB15. Our demonstration indicates that this benchmark framework captures the differences in similarity between real world data and synthetic data of varying quality well, and can therefore be used to assess the quality of generated synthetic data.

References

[1]
Arjovsky M., Chintala S., Bottou L., Wasserstein GAN, 2017, arXiv arXiv:1701.07875.
[2]
Bai, C.Y., Lin, H.-T., Raffel, C., Kan, W., 2021. On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining.
[3]
Booij T.M., Chiscop I., Meeuwissen E., Moustafa N., den Hartog F.T.H., ToN-IoT the role of heterogeneity and the need for standardization of features and attack types in IoT network intrusion data sets, IEEE Internet Things J. 9 (1) (2022) 485–496,.
[4]
Borji A., Pros and cons of GAN evaluation measures: New developments, Comput. Vis. Image Underst. 215 (2021).
[5]
Charlier J., Singh A., Ormazabal G., State R., Schulzrinne H., SynGAN: Towards generating synthetic network attacks using GANs, 2019, arXiv abs/1908.09899.
[6]
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: Synthetic minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002) 321–357,.
[7]
Chen, T., Guestrin, C., 2016. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
[8]
Chen Y., Keogh E., Hu B., Begum N., Bagnall A., Mueen A., Batista G., UCR time series classification archive, 2015, URL: www.cs.ucr.edu/~eamonn/timeseriesdata/.
[9]
Choi E., Biswal S., Malin B.A., Duke J.D., Stewart W.F., Sun J., Generating multi-label discrete patient records using generative adversarial networks, in: Machine Learning in Health Care, 2017.
[10]
Claise B., Cisco systems NetFlow services export version 9, RFC 3954 (2004) 1–33.
[11]
Dankar F.K., Ibrahim M.K., Ismail L., A multi-dimensional evaluation of synthetic data generators, IEEE Access PP (2022) 1.
[12]
Ehrhart M., Resch B., Havas C., Niederseer D., A conditional GAN for generating time series data for stress detection in wearable physiological sensor data, Sens. (Basel Switz.) 22 (16) (2022),.
[13]
Fisher R.A., Statistical Methods for Research Workers, Springer New York, New York, NY, 1992,.
[14]
Goncalves A., Ray P., Soper B., Stevens J., Coyle L., Sales A.P., Generation and evaluation of synthetic patient data, BMC Med. Res. Methodol. 20 (1) (2020) 108,.
[15]
Goodfellow I.J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A.C., Bengio Y., Generative adversarial nets, in: Neural Information Processing Systems, NIPS, 2014.
[16]
Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A., Improved training of wasserstein GANs, 2017, arXiv arXiv:1704.00028.
[17]
Guo, Y., Xiong, G., Li, Z., Shi, J., Cui, M., Gou, G., 2021. Combating Imbalance in Network Traffic Classification Using GAN Based Oversampling. In: 2021 IFIP Networking Conference. IFIP Networking, pp. 1–9.
[18]
Han S., Hu X., Huang H., Jiang M., Zhao Y., ADBench: Anomaly detection benchmark, 2022, arXiv abs/2206.09426.
[19]
Han J., Pei J., Tong H., Data Mining: Concepts and Techniques, Morgan kaufmann, 2011.
[20]
Heusel M., Ramsauer H., Unterthiner T., Nessler B., Hochreiter S., GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in: Neural Information Processing Systems, NIPS, 2017.
[21]
Hu J., Bayesian estimation of attribute and identification disclosure risks in synthetic data, Trans. Data Priv. 12 (2018) 61–89.
[22]
Karras, T., Laine, S., Aila, T., 2018. A Style-Based Generator Architecture for Generative Adversarial Networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 4396–4405.
[23]
Kholgh D.K., Kostakos P., PAC-GPT: A novel approach to generating synthetic network traffic with GPT-3, IEEE Access 11 (2023) 114936–114951,.
[24]
Koochali A., Walch M., Thota S., Schichtel P., Dengel A.R., Ahmed S., Quantifying quality of class-conditional generative models in time-series domain, 2022, arXiv abs/2210.07617.
[25]
Kullback S., Leibler R.A., On information and sufficiency, Ann. Math. Stat. 22 (1) (1951) 79–86,.
[26]
Liu X., Li T., Zhang R., Wu D., Liu Y., Yang Z., A GAN and feature selection-based oversampling technique for intrusion detection, Secur. Commun. Netw. 2021 (2021) 9947059:1–9947059:15.
[27]
Liu, F.T., Ting, K.M., Zhou, Z.-H., 2008. Isolation Forest. In: 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422.
[28]
Lopez-Paz D., Oquab M., Revisiting classifier two-sample tests, 2016, arXiv arXiv:1610.06545.
[29]
Manocchio, L.D., Layeghy, S., Portmann, M., 2021. FlowGAN - Synthetic Network Flow Generation using Generative Adversarial Networks. In: 2021 IEEE 24th International Conference on Computational Science and Engineering. CSE, pp. 168–176.
[30]
Moustafa N., Slay J., UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set), in: 2015 Military Communications and Information Systems Conference, MilCIS, 2015, pp. 1–6,.
[31]
Nekvi R.I., Saha S., Mtawa Y.A., Haque A., Examining generative adversarial network for smart home DDoS traffic generation, in: 2023 International Symposium on Networks, Computers and Communications, ISNCC, IEEE, Doha, Qatar, 2023, pp. 1–6,. URL: https://ieeexplore.ieee.org/document/10323616/.
[32]
Park N., Mohammadi M., Gorde K., Jajodia S., Park H., Kim Y., Data synthesis based on generative adversarial networks, Proc. VLDB Endow. 11 (2018) 1071–1083.
[33]
Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E., Scikit-learn: Machine learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[34]
Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I., et al., Language models are unsupervised multitask learners, OpenAI Blog 1 (8) (2019) 9.
[35]
Ring M., Schlör D., Landes D., Hotho A., Flow-based network traffic generation using generative adversarial networks, Comput. Secur. 82 (2018) 156–172.
[36]
Ring M., Wunderlich S., Scheuring D., Landes D., Hotho A., A survey of network-based intrusion detection data sets, Comput. Secur. 86 (2019) 147–167.
[37]
Ruiz N., Muralidhar K., Domingo-Ferrer J., On the privacy guarantees of synthetic data: A reassessment from the maximum-knowledge attacker perspective, in: Privacy in Statistical Databases, 2018.
[38]
Sajjadi M.S.M., Bachem O., Lucic M., Bousquet O., Gelly S., Assessing generative models via precision and recall, 2018, arXiv abs/1806.00035.
[39]
Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., Chen X., Improved techniques for training gans, Adv. Neural Inf. Process. Syst. 29 (2016).
[40]
Sarhan M., Layeghy S., Moustafa N., Portmann M., NetFlow datasets for machine learning-based network intrusion detection systems, in: Deze Z., Huang H., Hou R., Rho S., Chilamkurti N. (Eds.), Big Data Technologies and Applications, Springer International Publishing, Cham, 2021, pp. 117–135.
[41]
Schlör D., Detecting Anomalies in Transaction Data, (Doctoral thesis) Universität Würzburg, 2022.
[42]
Scholkopf B., Williamson R.C., Smola A., Shawe-Taylor J., Platt J.C., Support vector method for novelty detection, in: Neural Information Processing Systems, 1999.
[43]
Scott D.W., Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, 2015.
[44]
Sharafaldin I., Habibi Lashkari A., Ghorbani A.A., Toward generating a new intrusion detection dataset and intrusion traffic characterization, in: Proceedings of the 4th International Conference on Information Systems Security and Privacy, ICISSP-INSTICC, SciTePress, 2018, pp. 108–116,.
[45]
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A., 2018b. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In: International Conference on Information Systems Security and Privacy.
[46]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2014. Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp. 1–9.
[47]
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A., 2009. A detailed analysis of the KDD CUP 99 data set. In: 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications. pp. 1–6.
[48]
Theil H., On the estimation of relationships involving qualitative variables, Am. J. Sociol. 76 (1) (1970) 103–154.
[49]
Woo M.-J., Reiter J.P., Oganian A., Karr A.F., Global measures of data utility for microdata masked for disclosure limitation, J. Priv. Confid. 1 (2009).
[50]
Yin Y., Lin Z., Jin M., Fanti G., Sekar V., Practical GAN-based synthetic IP header trace generation using NetShare, in: Kuipers F., Orda A. (Eds.), Proceedings of the ACM SIGCOMM 2022 Conference, ACM, New York, NY, USA, 2022, pp. 458–472,.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computers and Security
Computers and Security  Volume 145, Issue C
Oct 2024
444 pages

Publisher

Elsevier Advanced Technology Publications

United Kingdom

Publication History

Published: 01 October 2024

Author Tags

  1. NetFlow
  2. Synthetic data
  3. Generator
  4. GPT
  5. GAN
  6. Benchmark
  7. Evaluation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media