Missing and Incomplete Data Handling in Cybersecurity Applications

Marek Pawlicki^12,13,
Michał Choraś^12,13,
Rafał Kozik^12,13 &
…
Witold Hołubowicz¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12672))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

1962 Accesses
3 Citations

Abstract

In intelligent information systems data plays a critical role. Preparing data for the use of artificial intelligence is therefore a substantial step in the processing pipeline. Sometimes, modest improvements in data quality can translate into a vastly superior model. The issue of missing data is one of the commonplace problems occurring in data collected in the real world. The problem stems directly from the very nature of data collection. In this paper, the notion of handling missing values in a real-world application of computational intelligence is considered. Six different approaches to missing values are evaluated, and their influence on the results of the Random Forest algorithm trained using the CICIDS2017 intrusion detection benchmark dataset is assessed. In result of the experiments it transpired that the chosen algorithm for data imputation has a severe impact on the results of the classifier used for network intrusion detection. It also comes to light that one of the most popular approaches to handling missing data - complete case analysis - should never be used in cybersecurity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A survey on missing data in machine learning

Article Open access 27 October 2021

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Extending Machine Learning-Based Intrusion Detection with the Imputation Method

References

Andridge, R.R., Little, R.J.A.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78(1), 40–64 (2010)
Article Google Scholar
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Meth. Psychiatr. Res. 20(1), 40–49 (2011)
Article Google Scholar
Baguley, T., Andrews, M.: Handling missing data. In: Modern Statistical Methods for HCI, pp. 57–82 (2016)
Google Scholar
Baio, Gianluca, Leurent, Baptiste: An introduction to handling missing data in health economic evaluations. In: Round, Jeff (ed.) Care at the End of Life, pp. 73–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-28267-1_6
Chapter Google Scholar
Benferhat, S, Tabia, K., Ali, M.: Advances in artificial intelligence: from theory to practice. In: 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE, pages Proceedings, Part I (2017)
Google Scholar
Choraś, M., Pawlicki, M.: Intrusion detection approach based on optimised artificial neural network. Neurocomputing (2020)
Google Scholar
Doreswamy, I.G., Manjunatha, B.R.: Performance evaluation of predictive models for missing data imputation in weather data. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1327–1334 (2017)
Google Scholar
Ezzine, I., Benhlima, L.: A study of handling missing data methods for big data. In: 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), pp. 498–501 (2018)
Google Scholar
Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4(5), 1–217 (2012)
Article Google Scholar
Chang, G., Ge, T.: Comparison of missing data imputation methods for traffic flow. In: Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), pp. 639–642 (2011)
Google Scholar
Gleason, T.C., Staelin, R.: A proposal for handling missing data. Psychometrika 40(2), 229–252 (1975)
Article Google Scholar
Graham, J.W.: Missing Data. Springer, New York, New York, NY (2012)
Book Google Scholar
Jakobsen, J.C., Gluud, C., Wetterslev, J., Winkel, P.: When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med. Res. Meth. 17(1), 162 (2017)
Article Google Scholar
Ksieniewicz, P., Woźniak, M.: Imbalanced data classification based on feature selection techniques. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 296–303. Springer (2018)
Google Scholar
Li, Q., Tan, H., Wu, Y., Ye, L., Ding, F.: Traffic flow prediction with missing data imputed by tensor completion methods. IEEE Access 8, 63188–63201 (2020)
Article Google Scholar
Liu, S., Dai, H.: Examination of reliability of missing value recovery in data mining. In: 2014 IEEE International Conference on Data Mining Workshop, pp. 306–313 (2014)
Google Scholar
Lu, X., Si, J., Pan, L., Zhao, Y.: Imputation of missing data using ensemble algorithms. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 2, pp. 1312–1315 (2011)
Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(80), 2287–2322 (2010)
MathSciNet MATH Google Scholar
Nogueira, B.M., Santos, T.R.A., Zarate, L.E.: Comparison of classifiers efficiency on missing values recovering: application in a marketing database with massive missing data. In: 2007 IEEE Symposium on Computational Intelligence and Data Mining, pp. 66–72 (2007)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Prince, J., Andreotti, F., De Vos, M.: Evaluation of source-wise missing data techniques for the prediction of parkinson’s disease using smartphones. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3927–3930 (2019)
Google Scholar
Raghunathan, T.E., Lepkowski, J.M., Hoewyk, J.V., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27(1), 85–96 (2001)
Google Scholar
Rana, S., John, A.H., Midi, H.: Robust regression imputation for analyzing missing data. In: 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE), pp. 1–4 (2012)
Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet Google Scholar
Rubinsteyn, A., Feldman, S., O’Donnell, T., Beaulieu-Jones, B.: Hammerlab/fancyimpute: Version 0.2. 0. Zenodo. doi, 10 (2017)
Google Scholar
Sakurai, D., et al.: Estimation of missing data of showcase using artificial neural networks. In: 2017 IEEE 10th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 15–18 (2017)
Google Scholar
Santos, M.S., Pereira, R.C., Costa, A.F., Soares, J.P., Santos, J., Abreu, P.H.: Generating synthetic missing data: a review by missing mechanism. IEEE Access 7, 11651–11667 (2019)
Article Google Scholar
Sessa, J., Syed, D.: Techniques to deal with missing data. In: 2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), pp. 1–4 (2016)
Google Scholar
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116 (2018)
Google Scholar
Shi, W., et al.: Effective prediction of missing data on apache spark over multivariable time series. IEEE Trans. Big Data 4(4), 473–486 (2018)
Article Google Scholar
Stekhoven, D.J., Buhlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Article Google Scholar
Tripathi, A.K., Rathee, G., Saini, H.: Taxonomy of missing data along with their handling methods. In: 2019 Fifth International Conference on Image Information Processing (ICIIP), pp. 463–468 (2019)
Google Scholar
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Article Google Scholar
Umathe, V.H., Chaudhary, G.: Imputation methods for incomplete data. In: 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–4 (2015)
Google Scholar
Wang, H., Shouhong.: A knowledge acquisition method for missing data. In: 2008 International Symposium on Knowledge Acquisition and Modeling, pp. 152–156 (2008)
Google Scholar
Yeon, H., Son, H., Jang, Y.: Visual imputation analytics for missing time-series data in Bayesian network. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 303–310 (2020)
Google Scholar
Zeng, D., Xie, D., Liu, R., Li, X.: Missing value imputation methods for TCM medical data and its effect in the classifier accuracy. In: 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom), pp. 1–4 (2017)
Google Scholar
Zhang, L., Xie, Y., Xi-dao, L., Zhang, X.: Multi-source heterogeneous data fusion. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), pp. 47–51 (2018)
Google Scholar
Zhang, Y., Kambhampati, C., Davis, D.N., Goode, K., Cleland, J.G F.: A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In: 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2840–2844 (2012)
Google Scholar

Download references

Acknowledgement

This work is funded under the PREVISION project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 833115.

Author information

Authors and Affiliations

ITTI Sp. Z O.O., Poznań, Poland
Marek Pawlicki, Michał Choraś & Rafał Kozik
UTP University of Science and Technology, Bydgoszcz, Poland
Marek Pawlicki, Michał Choraś, Rafał Kozik & Witold Hołubowicz

Authors

Marek Pawlicki
View author publications
You can also search for this author in PubMed Google Scholar
Michał Choraś
View author publications
You can also search for this author in PubMed Google Scholar
Rafał Kozik
View author publications
You can also search for this author in PubMed Google Scholar
Witold Hołubowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Pawlicki .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Suphamit Chittayasothorn
Nanyang Technological University, Singapore, Singapore
Dusit Niyato
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pawlicki, M., Choraś, M., Kozik, R., Hołubowicz, W. (2021). Missing and Incomplete Data Handling in Cybersecurity Applications. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2021. Lecture Notes in Computer Science(), vol 12672. Springer, Cham. https://doi.org/10.1007/978-3-030-73280-6_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-73280-6_33
Published: 05 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73279-0
Online ISBN: 978-3-030-73280-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Missing and Incomplete Data Handling in Cybersecurity Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on missing data in machine learning

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Extending Machine Learning-Based Intrusion Detection with the Imputation Method

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Missing and Incomplete Data Handling in Cybersecurity Applications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on missing data in machine learning

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Extending Machine Learning-Based Intrusion Detection with the Imputation Method

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation