Abstract
This paper aims to present several clustering methods based on rank distance. Rank distance has applications in many different fields such as computational linguistics, biology and computer science. The K-means algorithm represents each cluster by a single mean vector. The mean vector is computed with respect to a distance measure. Two K-means algorithms based on rank distance are described in this paper. Hierarchical clustering builds models based on distance connectivity. This paper describes two hierarchical clustering techniques that use rank distance. Experiments using mitochondrial DNA sequences extracted from several mammals are performed to compare the results of the clustering methods. Results demonstrate the clustering performance and the utility of the proposed algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chimani M, Woste M, Bocker S (2011) A closer look at the closest string and closest substring problem. In: Proceedings of ALENEX, pp 13–24
de la Higuera C, Casacuberta F (2000) Topology of strings: median string is np-complete. Theor Comput Sci 230:39–48
Diaconis P, Graham RL (1977) Spearman footrule as a measure of disarray. J R Stat Soc Ser B (Methodological) 39(2):262–268
Dinu LP (2003) On the classification and aggregation of hierarchies with different constitutive elements. Fundamenta Informaticae 55(1):39–50
Dinu A, Dinu LP (2005) On the syllabic similarities of romance languages. In: Proceedings of CICLing 3406, pp 785–788
Dinu LP, Ionescu RT (2012) An efficient rank based approach for closest string and closest substring. PLoS One 7(6):e37576
Dinu LP, Ionescu RT (2012a) Clustering based on rank distance with applications on DNA. In: Proceedings of ICONIP 7667
Dinu LP, Ionescu RT (2012b) Clustering methods based on closest string via rank distance. In: Proceedings of SYNASC, pp 207–214
Dinu LP, Manea F (2006) An efficient approach for the rank aggregation problem. Theor Comput Sci 359(1–3):455–461
Dinu LP, Popa A (2012) On the closest string via rank distance. In: Proceedings of CPM 7354, pp 413–426
Dinu LP, Sgarro A (2006) A low-complexity distance for DNA strings. Fundamenta Informaticae 73(3):361–372
Frances M, Litman A (1997) On covering problems of codes. Theory Comput Syst 30(2):113–119
Huang Z (1998) Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM international conference on data mining
Koonin EV (1999) The emerging paradigm and open problems in comparative genomics. Bioinformatics 15:265–266
Lanctot KJ, Li M, Ma B, Wang S, Zhang L (2003) Distinguishing string selection problems. Inf Comput 185(1):41–55
Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Liew AW, Yan H, Yang M (2005) Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit 38(11):2055–2073
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD, pp 169–178
Nicolas F, Rivals E (2003) Complexities of centre and median string 2676:315–327
Nicolas F, Rivals E (2005) Hardness results for the center and median string problems under the weighted and unweighted edit distances. J Discret Algorithms 3(2–4):390–415
Palmer J, Herbon L (1988) Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J Mol Evolut 28:87–89
Popov YV (2007) Multiple genome rearrangement by swaps and by element duplications. Theor Comput Sci 385(1–3):115–126
Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17(6):979–983
Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell PAMI 6(1):81–87
Smith T, Waterman M (1981) Comparison of biosequences. Adv Appl Math 2(4):482–489
States DJ, Agarwal P (1996) Compact encoding strategies for DNA sequence similarity search. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, pp 211–217
Tian TZ, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Wooley JC (1999) Trends in computational biology: a summary based on a recomb plenary lecture. J Comput Biol 6:459–474
Yin C, Zhao X, Mu S, Tian S (2013) A fast multiclass classification algorithm based on cooperative clustering. Neural Process Lett 1–14. doi:10.1007/s11063-013-9278-9
Acknowledgments
The contribution of the authors to this paper is equal. Authors thank anonymous reviewers for helpful comments. The research of Liviu P. Dinu was supported by a grant of the Romanian National Authority for Scientific Research, CNCS UEFISCDI, project number PN-II-ID-PCE-2011-3-0959. Radu Tudor Ionescu thanks his Ph.D. supervisor Denis Enachescu from the University of Bucharest, for helpful discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dinu, L.P., Ionescu, R.T. Clustering based on median and closest string via rank distance with applications on DNA. Neural Comput & Applic 24, 77–84 (2014). https://doi.org/10.1007/s00521-013-1468-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-013-1468-x