Abstract
The discretization of continuous attributes is an important pre-processing step for machine learning and data mining. How to efficiently process the discretization of continuous attributes of massive data has become an urgent problem to be resolved. Hadoop as a rising technique in recent years can efficiently process many applications based on massive data. This paper designs and implements a parallel Chi2-based discretization algorithm based on MapReduce model. On the premise of the discretization efficiency, experiments have been done by using different size of data sets in the different nodes. The experimental results show that the proposed algorithm has high efficiency and good scalability to process the discretization of continuous attributes of massive data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kurgan, L.A., Cios, K.J.: CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering 16(2), 145–153 (2004)
Mittal, A., Cheong, L.: Employing discrete Bayes error rate for discretization and feature selection tasks. In: Proceedings of the 1st IEEE International Conference on Data Mining (ICDM 2002), pp. 298–305 (2002)
Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: an enabling technique. Journal of Data Mining and Knowledge Discovery 6(4), 393–423 (2002)
Tsai, C.J., Lee, C.I., Yang, W.P.: A discretization algorithm based onclass-attribute contingency coefficient. Information Sciences 178, 714–731 (2008)
Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence 9, 796–805 (1987)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceeding of Thirteenth International Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Ching, J.Y., Wong, A.K.C., Chan, K.C.C.: Class-dependent discretization for inductive learning from continuous and mixed mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(7), 641–651 (1995)
Kurgan, L., Cios, K.J.: Fast class-attribute interdependence maximization (CAIM) discretization algorithm. In: Proceeding of International Conference on Machine Learning and Applications, pp. 30–36 (2003)
Kerber, R.: ChiMerge: discretization of numeric attributes. In: Proceeding of Ninth International Conference on Artificial Intelligence, pp. 123–128 (1992)
Liu, H., Setiono, R.: Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9(4), 642–645 (1997)
Tay, F., Shen, L.: A modified chi2 algorithm for discretization. IEEE Transactions on Knowledge and Data Engineering 14(3), 666–670 (2002)
Su, C.T., Hsu, J.H.: An extended chi2 algorithm for discretization of real value attributes. IEEE Transactions on Knowledge and Data Engineering 17(3), 437–441 (2005)
Dean, J., Ghemawat, S.: Mapreduce: simplied data processing on large clusters. In: The 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, USA, pp. 137–150 (2004)
Qian, J., Miao, D., Zhang, Z., Yue, X.: Parallel attribute reduction algorithms using MapReduce. Information Sciences 279, 671–690 (2014)
Alham, N.K., Li, M., Liu, Y., Qi, M.: A MapReduce-based distributed SVM ensemble for scalable image classification and annotation. Computers & Mathematics with Applications 66(10), 1920–1934 (2013)
Chen, J., Zheng, G., Chen, H.: ELM-MapReduce: MapReduce accelerated extreme learning machine for big spatial data analysis. In: Proceedings of the 10th IEEE International Conference on Control and Automation (ICCA), pp. 400–405 (2013)
Hadoop. Apache Software Foundation. http://hadoop.apache.org
Frank, A., Asuncion, A: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2010). http://archive.ics.uci.edu/ml
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, Y., Yu, J., Wang, J. (2015). Parallel Implementation of Chi2 Algorithm in MapReduce Framework. In: Zu, Q., Hu, B., Gu, N., Seng, S. (eds) Human Centered Computing. HCC 2014. Lecture Notes in Computer Science(), vol 8944. Springer, Cham. https://doi.org/10.1007/978-3-319-15554-8_83
Download citation
DOI: https://doi.org/10.1007/978-3-319-15554-8_83
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-15553-1
Online ISBN: 978-3-319-15554-8
eBook Packages: Computer ScienceComputer Science (R0)