Abstract
The mixed data clustering algorithms have been timidly emerging since the end of the last century. One of the last algorithms proposed for this data-type has been KAMILA (KAy-means for MIxed LArge data) algorithm. While the KAMILA has outperformed the previous mixed data algorithms results, it has some gaps. Among them is the definition of numerical and categorical variable weights, which is a user-defined parameter or, by default, equal to one for all features. Hence, we propose an optimization algorithm called Biased Random-Key Genetic Algorithm for Features Weighting (BRKGAFW) to accomplish the weighting of the numerical and categorical variables in the KAMILA algorithm. The experiment relied on six real-world mixed data sets and two baseline algorithms to perform the comparison, which are the KAMILA with default weight definition, and the KAMILA with weight definition done by the traditional genetic algorithm. The results have revealed the proposed algorithm overperformed the baseline algorithms results in all data sets.
This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) - Brazil under grant number 306075/2017-2 and 430137/2018-4 and by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Brazil.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019). https://doi.org/10.1109/ACCESS.2019.2903568
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Foss, A., Markatou, M.: KAMILA: clustering mixed-type data in R and hadoop. J. Stat. Softw. 83(13), 1–44 (2018). https://doi.org/10.18637/jss.v083.i13
Foss, A., Markatou, M., Ray, B., Heching, A.: A semiparametric method for clustering mixed data. Mach. Learn. 105(3), 419–458 (2016). https://doi.org/10.1007/s10994-016-5575-7
Framinan, J.M., Nagano, M.S.: Evaluating the performance for makespan minimisation in no-wait flowshop sequencing. J. Mater. Process. Technol. 197(1–3), 1–9 (2008). https://doi.org/10.1016/j.jmatprotec.2007.07.039
Gonçalves, J.A., Almeida, J.F., Raimundo, J.: A hybrid genetic algorithm for assembly line balancing. J. Heuristics 8, 629–642 (2002). https://doi.org/10.1023/A:1020377910258
Gonçalves, J.F.: A hybrid genetic algorithm-heuristic for a two-dimensional orthogonal packing problem. Eur. J. Oper. Res. 183, 1212–1229 (2007). https://doi.org/10.1016/j.ejor.2005.11.062
Gonçalves, J.F., Mendes, J.J.M., Resende, M.G.C.: A hybrid genetic algorithm for the job shop scheduling problem. Eur. J. Oper. Res. 167, 77–95 (2005). https://doi.org/10.1016/j.ejor.2004.03.012
Gonçalves, J.F., Resende, M.G.C.: Biased random-key genetic algorithms for combinatorial optimization. J. Heuristics 17, 487–525 (2011). https://doi.org/10.1007/s10732-010-9143-1
Gonçalves, J.F., Resende, M.G.C.: A parallel multi-population genetic algorithm for a constrained two-dimensional orthogonal packing problem. J. Comb. Optim. 22, 180–201 (2011). https://doi.org/10.1007/s10878-009-9282-1
Gonçalves, J.F., Resende, M.G.C., Mendes, J.J.M.: A biased random-key genetic algorithm with forward-backward improvement for the resource constrained project scheduling problem. J. Heuristics 17, 467–486 (2011). https://doi.org/10.1007/s10732-010-9142-2
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 1997, Singapore, pp. 1–34 (1997)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Disc. 2(3), 283–304 (1998)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988)
Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120(23), 590–596 (2013)
Lichman, M.: UCI machine learning repository (2013)
Saxena, A., et al.: A review of clustering techniques and developments. Neurocomputing 267, 664–681 (2017). https://doi.org/10.1016/j.neucom.2017.06.053
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Wei, M., Chow, T.W.S., Chan, R.H.M.: Clustering heterogeneous data with k-means by mutual information-based unsupervised feature transformation. Entropy 17(3), 1535–1548 (2015)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)
Xu, R., Wunsch, D.: Clustering. Wiley-IEEE Press, Hoboken, Piscataway (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Martarelli, N.J., Nagano, M.S. (2019). Optimization of the Numeric and Categorical Attribute Weights in KAMILA Mixed Data Clustering Algorithm. In: Yin, H., Camacho, D., Tino, P., Tallón-Ballesteros, A., Menezes, R., Allmendinger, R. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2019. IDEAL 2019. Lecture Notes in Computer Science(), vol 11871. Springer, Cham. https://doi.org/10.1007/978-3-030-33607-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-33607-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33606-6
Online ISBN: 978-3-030-33607-3
eBook Packages: Computer ScienceComputer Science (R0)