Clustering based on median and closest string via rank distance with applications on DNA

Liviu P. Dinu¹ &
Radu Tudor Ionescu¹

364 Accesses
Explore all metrics

Abstract

This paper aims to present several clustering methods based on rank distance. Rank distance has applications in many different fields such as computational linguistics, biology and computer science. The K-means algorithm represents each cluster by a single mean vector. The mean vector is computed with respect to a distance measure. Two K-means algorithms based on rank distance are described in this paper. Hierarchical clustering builds models based on distance connectivity. This paper describes two hierarchical clustering techniques that use rank distance. Experiments using mitochondrial DNA sequences extracted from several mammals are performed to compare the results of the clustering methods. Results demonstrate the clustering performance and the utility of the proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

Article Open access 01 July 2021

New Metrics for Classifying Phylogenetic Trees Using K-means and the Symmetric Difference Metric

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Chimani M, Woste M, Bocker S (2011) A closer look at the closest string and closest substring problem. In: Proceedings of ALENEX, pp 13–24
de la Higuera C, Casacuberta F (2000) Topology of strings: median string is np-complete. Theor Comput Sci 230:39–48
Article MATH Google Scholar
Diaconis P, Graham RL (1977) Spearman footrule as a measure of disarray. J R Stat Soc Ser B (Methodological) 39(2):262–268
MATH MathSciNet Google Scholar
Dinu LP (2003) On the classification and aggregation of hierarchies with different constitutive elements. Fundamenta Informaticae 55(1):39–50
MATH MathSciNet Google Scholar
Dinu A, Dinu LP (2005) On the syllabic similarities of romance languages. In: Proceedings of CICLing 3406, pp 785–788
Dinu LP, Ionescu RT (2012) An efficient rank based approach for closest string and closest substring. PLoS One 7(6):e37576
Article Google Scholar
Dinu LP, Ionescu RT (2012a) Clustering based on rank distance with applications on DNA. In: Proceedings of ICONIP 7667
Dinu LP, Ionescu RT (2012b) Clustering methods based on closest string via rank distance. In: Proceedings of SYNASC, pp 207–214
Dinu LP, Manea F (2006) An efficient approach for the rank aggregation problem. Theor Comput Sci 359(1–3):455–461
Article MATH MathSciNet Google Scholar
Dinu LP, Popa A (2012) On the closest string via rank distance. In: Proceedings of CPM 7354, pp 413–426
Dinu LP, Sgarro A (2006) A low-complexity distance for DNA strings. Fundamenta Informaticae 73(3):361–372
MATH MathSciNet Google Scholar
Frances M, Litman A (1997) On covering problems of codes. Theory Comput Syst 30(2):113–119
MATH MathSciNet Google Scholar
Huang Z (1998) Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304
Article Google Scholar
Kailing K, Kriegel HP, Kroger P (2004) Density-connected subspace clustering for high-dimensional data. In Proceedings of the 4th SIAM international conference on data mining
Koonin EV (1999) The emerging paradigm and open problems in comparative genomics. Bioinformatics 15:265–266
Article Google Scholar
Lanctot KJ, Li M, Ma B, Wang S, Zhang L (2003) Distinguishing string selection problems. Inf Comput 185(1):41–55
Article MATH MathSciNet Google Scholar
Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Article MathSciNet Google Scholar
Liew AW, Yan H, Yang M (2005) Pattern recognition techniques for the emerging field of bioinformatics: a review. Pattern Recognit 38(11):2055–2073
Article Google Scholar
McCallum A, Nigam K, Ungar LH (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of ACM SIGKDD, pp 169–178
Nicolas F, Rivals E (2003) Complexities of centre and median string 2676:315–327
Google Scholar
Nicolas F, Rivals E (2005) Hardness results for the center and median string problems under the weighted and unweighted edit distances. J Discret Algorithms 3(2–4):390–415
MATH MathSciNet Google Scholar
Palmer J, Herbon L (1988) Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J Mol Evolut 28:87–89
Article Google Scholar
Popov YV (2007) Multiple genome rearrangement by swaps and by element duplications. Theor Comput Sci 385(1–3):115–126
Article MATH Google Scholar
Reyes A, Gissi C, Pesole G, Catzeflis FM, Saccone C (2000) Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol Biol Evol 17(6):979–983
Article Google Scholar
Selim SZ, Ismail MA (1984) K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell PAMI 6(1):81–87
Google Scholar
Smith T, Waterman M (1981) Comparison of biosequences. Adv Appl Math 2(4):482–489
Article MATH MathSciNet Google Scholar
States DJ, Agarwal P (1996) Compact encoding strategies for DNA sequence similarity search. In: Proceedings of the 4th international conference on intelligent systems for molecular biology, pp 211–217
Tian TZ, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Article Google Scholar
Wooley JC (1999) Trends in computational biology: a summary based on a recomb plenary lecture. J Comput Biol 6:459–474
Article Google Scholar
Yin C, Zhao X, Mu S, Tian S (2013) A fast multiclass classification algorithm based on cooperative clustering. Neural Process Lett 1–14. doi:10.1007/s11063-013-9278-9

Download references

Acknowledgments

The contribution of the authors to this paper is equal. Authors thank anonymous reviewers for helpful comments. The research of Liviu P. Dinu was supported by a grant of the Romanian National Authority for Scientific Research, CNCS UEFISCDI, project number PN-II-ID-PCE-2011-3-0959. Radu Tudor Ionescu thanks his Ph.D. supervisor Denis Enachescu from the University of Bucharest, for helpful discussions.

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, University of Bucharest, No. 14 Academiei Street, Bucharest, Romania
Liviu P. Dinu & Radu Tudor Ionescu

Authors

Liviu P. Dinu
View author publications
You can also search for this author in PubMed Google Scholar
Radu Tudor Ionescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radu Tudor Ionescu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dinu, L.P., Ionescu, R.T. Clustering based on median and closest string via rank distance with applications on DNA. Neural Comput & Applic 24, 77–84 (2014). https://doi.org/10.1007/s00521-013-1468-x

Download citation

Received: 05 March 2013
Accepted: 15 July 2013
Published: 03 August 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s00521-013-1468-x

Clustering based on median and closest string via rank distance with applications on DNA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

New Metrics for Classifying Phylogenetic Trees Using K-means and the Symmetric Difference Metric

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Clustering based on median and closest string via rank distance with applications on DNA

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter

New Metrics for Classifying Phylogenetic Trees Using K-means and the Symmetric Difference Metric

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation