Abstract
In intelligent information systems data plays a critical role. Preparing data for the use of artificial intelligence is therefore a substantial step in the processing pipeline. Sometimes, modest improvements in data quality can translate into a vastly superior model. The issue of missing data is one of the commonplace problems occurring in data collected in the real world. The problem stems directly from the very nature of data collection. In this paper, the notion of handling missing values in a real-world application of computational intelligence is considered. Six different approaches to missing values are evaluated, and their influence on the results of the Random Forest algorithm trained using the CICIDS2017 intrusion detection benchmark dataset is assessed. In result of the experiments it transpired that the chosen algorithm for data imputation has a severe impact on the results of the classifier used for network intrusion detection. It also comes to light that one of the most popular approaches to handling missing data - complete case analysis - should never be used in cybersecurity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andridge, R.R., Little, R.J.A.: A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78(1), 40–64 (2010)
Azur, M.J., Stuart, E.A., Frangakis, C., Leaf, P.J.: Multiple imputation by chained equations: what is it and how does it work? Int. J. Meth. Psychiatr. Res. 20(1), 40–49 (2011)
Baguley, T., Andrews, M.: Handling missing data. In: Modern Statistical Methods for HCI, pp. 57–82 (2016)
Baio, Gianluca, Leurent, Baptiste: An introduction to handling missing data in health economic evaluations. In: Round, Jeff (ed.) Care at the End of Life, pp. 73–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-28267-1_6
Benferhat, S, Tabia, K., Ali, M.: Advances in artificial intelligence: from theory to practice. In: 30th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE, pages Proceedings, Part I (2017)
Choraś, M., Pawlicki, M.: Intrusion detection approach based on optimised artificial neural network. Neurocomputing (2020)
Doreswamy, I.G., Manjunatha, B.R.: Performance evaluation of predictive models for missing data imputation in weather data. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1327–1334 (2017)
Ezzine, I., Benhlima, L.: A study of handling missing data methods for big data. In: 2018 IEEE 5th International Congress on Information Science and Technology (CiSt), pp. 498–501 (2018)
Fan, W., Geerts, F.: Foundations of data quality management. Synth. Lect. Data Manage. 4(5), 1–217 (2012)
Chang, G., Ge, T.: Comparison of missing data imputation methods for traffic flow. In: Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE), pp. 639–642 (2011)
Gleason, T.C., Staelin, R.: A proposal for handling missing data. Psychometrika 40(2), 229–252 (1975)
Graham, J.W.: Missing Data. Springer, New York, New York, NY (2012)
Jakobsen, J.C., Gluud, C., Wetterslev, J., Winkel, P.: When and how should multiple imputation be used for handling missing data in randomised clinical trials - a practical guide with flowcharts. BMC Med. Res. Meth. 17(1), 162 (2017)
Ksieniewicz, P., Woźniak, M.: Imbalanced data classification based on feature selection techniques. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 296–303. Springer (2018)
Li, Q., Tan, H., Wu, Y., Ye, L., Ding, F.: Traffic flow prediction with missing data imputed by tensor completion methods. IEEE Access 8, 63188–63201 (2020)
Liu, S., Dai, H.: Examination of reliability of missing value recovery in data mining. In: 2014 IEEE International Conference on Data Mining Workshop, pp. 306–313 (2014)
Lu, X., Si, J., Pan, L., Zhao, Y.: Imputation of missing data using ensemble algorithms. In: 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 2, pp. 1312–1315 (2011)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(80), 2287–2322 (2010)
Nogueira, B.M., Santos, T.R.A., Zarate, L.E.: Comparison of classifiers efficiency on missing values recovering: application in a marketing database with massive missing data. In: 2007 IEEE Symposium on Computational Intelligence and Data Mining, pp. 66–72 (2007)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Prince, J., Andreotti, F., De Vos, M.: Evaluation of source-wise missing data techniques for the prediction of parkinson’s disease using smartphones. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3927–3930 (2019)
Raghunathan, T.E., Lepkowski, J.M., Hoewyk, J.V., Solenberger, P.: A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27(1), 85–96 (2001)
Rana, S., John, A.H., Midi, H.: Robust regression imputation for analyzing missing data. In: 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE), pp. 1–4 (2012)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Rubinsteyn, A., Feldman, S., O’Donnell, T., Beaulieu-Jones, B.: Hammerlab/fancyimpute: Version 0.2. 0. Zenodo. doi, 10 (2017)
Sakurai, D., et al.: Estimation of missing data of showcase using artificial neural networks. In: 2017 IEEE 10th International Workshop on Computational Intelligence and Applications (IWCIA), pp. 15–18 (2017)
Santos, M.S., Pereira, R.C., Costa, A.F., Soares, J.P., Santos, J., Abreu, P.H.: Generating synthetic missing data: a review by missing mechanism. IEEE Access 7, 11651–11667 (2019)
Sessa, J., Syed, D.: Techniques to deal with missing data. In: 2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), pp. 1–4 (2016)
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. In: ICISSP, pp. 108–116 (2018)
Shi, W., et al.: Effective prediction of missing data on apache spark over multivariable time series. IEEE Trans. Big Data 4(4), 473–486 (2018)
Stekhoven, D.J., Buhlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Tripathi, A.K., Rathee, G., Saini, H.: Taxonomy of missing data along with their handling methods. In: 2019 Fifth International Conference on Image Information Processing (ICIIP), pp. 463–468 (2019)
Troyanskaya, O., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Umathe, V.H., Chaudhary, G.: Imputation methods for incomplete data. In: 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–4 (2015)
Wang, H., Shouhong.: A knowledge acquisition method for missing data. In: 2008 International Symposium on Knowledge Acquisition and Modeling, pp. 152–156 (2008)
Yeon, H., Son, H., Jang, Y.: Visual imputation analytics for missing time-series data in Bayesian network. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 303–310 (2020)
Zeng, D., Xie, D., Liu, R., Li, X.: Missing value imputation methods for TCM medical data and its effect in the classifier accuracy. In: 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom), pp. 1–4 (2017)
Zhang, L., Xie, Y., Xi-dao, L., Zhang, X.: Multi-source heterogeneous data fusion. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), pp. 47–51 (2018)
Zhang, Y., Kambhampati, C., Davis, D.N., Goode, K., Cleland, J.G F.: A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In: 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2840–2844 (2012)
Acknowledgement
This work is funded under the PREVISION project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 833115.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Pawlicki, M., Choraś, M., Kozik, R., Hołubowicz, W. (2021). Missing and Incomplete Data Handling in Cybersecurity Applications. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2021. Lecture Notes in Computer Science(), vol 12672. Springer, Cham. https://doi.org/10.1007/978-3-030-73280-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-73280-6_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73279-0
Online ISBN: 978-3-030-73280-6
eBook Packages: Computer ScienceComputer Science (R0)