Abstract
The k-nearest neighbors outlier detection is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. Furthermore, it gives to each attribute the same importance to outlier. There are several approaches to enhance its precision, with the entropy-based outlier detection being among the most successful ones. Entropy-based outlier detection computes attribute entropy of the data set to weighted distance formula for the outlier detection. Apart from the existing the k-nearest neighbors outlier detection to handle big datasets, there is not an entropy-based outlier detection to manage that volume of data. In this paper, we propose an entropy-based outlier detection based on Spark. It presents three separately stages. The first stage computes attribute entropy. The second stage finds the k nearest neighbors and calculates the degrees of outliers using the attribute entropy computed previously. The third stage ranks each point on the degrees of outliers and declares the top n points in this ranking to be outliers. Extensive experimental results show the advantages of the proposed method. This algorithm can improve the outlier detection precision, reduce the runtime and realize the effective large scale dataset outlier detection.
Similar content being viewed by others
References
Aggarwal, C.C.: Outlier Analysis. Springer, New York (2015)
Domingues, R., Filippone, M., Michiardi, P., Zouaoui, J.: A comparative evaluation of outlier detection algorithms: experiments and analyses. Pattern Recogn. 74, 406–421 (2017)
Ramaswamy,S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Paper presented at the ACM SIGMOD International Conference on Management of Data, ACM, pp. 427–438, (2000)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly, Sebastopol (2015)
Maillo, J., Ramírez, S., Triguero, I., et al.: kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2016)
Maillo, J., Luengo, J., García, S., et al.: Exact fuzzy k-nearest neighbor classification for big datasets. In: Paper presented at the IEEE International Conference on Fuzzy Systems, IEEE, (2017)
Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10 (2010)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Paper presented at the Conference on Networked Systems Design and Implementation, pp. 1–14, (2012)
Wu, S., Wang, S.: Information-theoretic outlier detection for large-scale categorical data. IEEE Trans. Knowl. Data Eng. 25, 589–602 (2013)
Subramanyam, R.B.V., Sonam, G.: Map-reduce algorithm for mining outliers in the large data sets using twister programing model. Int. J. Comput. Sci. Electron. Eng. 3(1), 81–86 (2015)
Guo, Y.P., Liang, J.Y., Zhao, X.W.: An outlier detection algorithm for mixed data based on MapReduce. J. Chin. Comput. Syst. 35(9), 1961–1966 (2014)
Cao, L., Yan, Y., Kuhlman, C., et al.: Multi-tactic distance-based outlier detection. In: Paper presented at the IEEE International Conference on Data Engineering, IEEE, pp. 959–970, (2017)
Hu, C.P., Qin, X.L.: A density-based local outlier detecting algorithm. J. Comput. Res. Dev. 47(12), 2110–2116 (2010)
Wang, J.H., Zhao, X.X., Zhang, G.Y.: NLOF: a new density-based local outlier detecting algorithm. Comput. Sci. 40(8), 181–185 (2013)
Xin, L.L., He, W., Yu, J.: An outlier detection algorithm based on density difference. J. Shangdong Univ. (Eng. Sci.) 45(3), 7–14 (2015)
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(379–423), 623–656 (1948)
Filippone, M., Sanguinetti, G.: Information theoretic novelty detection. Pattern Recogn. 43(3), 805–814 (2010)
Jiang, F., Sui, Y., Cao, C.: An information entropy-based approach to outlier detection in rough sets. Expert Syst. Appl. 37(9), 6338–6344 (2010)
Pang, G., Cao, L., Chen, L., et al.: Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings. In: Paper presented at the International Conference on Data Mining, IEEE, pp. 410–419, (2017)
Asuncion A.: UCI machine learning repository, (2013)
Yan, Y., Cao, L., Kulhman, C., et al: Distributed local outlier detection in big data. In: Paper presented at The ACM SIGKDD International Conference, pp. 1225–1234, (2017)
Sarumiab, O.A., Leungb, C.K., Adetunmbi, A.O.: Spark-based data analytics of sequence motifs in large omics data. Procedia Computer Science 136, 596–605 (2018)
Acknowledgements
This work was supported by the Civil Aviation Flight Data Analysis under No. XM2852 and Key Scientific and Technological Research Projects in Henan Province (Grand No. 192102210125).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Feng, G., Li, Z., Zhou, W. et al. Entropy-based outlier detection using spark. Cluster Comput 23, 409–419 (2020). https://doi.org/10.1007/s10586-019-02932-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-019-02932-2