Min-sum Clustering of Protein Sequences with Limited Distance Information

Konstantin Voevodski¹⁸,
Maria-Florina Balcan¹⁹,
Heiko Röglin²⁰,
Shang-Hua Teng²¹ &
…
Yu Xia²²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 7005))

Included in the following conference series:

International Workshop on Similarity-Based Pattern Recognition

774 Accesses
1 Citations

Abstract

We study the problem of efficiently clustering protein sequences in a limited information setting. We assume that we do not know the distances between the sequences in advance, and must query them during the execution of the algorithm. Our goal is to find an accurate clustering using few queries. We model the problem as a point set S with an unknown metric d on S, and assume that we have access to one versus all distance queries that given a point s ∈ S return the distances between s and all other points. Our one versus all query represents an efficient sequence database search program such as BLAST, which compares an input sequence to an entire data set. Given a natural assumption about the approximation stability of the min-sum objective function for clustering, we design a provably accurate clustering algorithm that uses few one versus all queries. In our empirical study we show that our method compares favorably to well-established clustering algorithms when we compare computationally derived clusterings to gold-standard manual classifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Clustering huge protein sequence sets in linear time

Article Open access 29 June 2018

Accurately clustering biological sequences in linear time by relatedness sorting

Article Open access 08 April 2024

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

Article Open access 05 February 2015

References

Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33(3) (2004)
Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Article Google Scholar
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Proc. of 23rd Conference on Neural Information Processing Systems, NIPS (2009)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of 18th ACM-SIAM Symp. on Discrete Algorithms, SODA (2007)
Google Scholar
Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proc. of 20th ACM-SIAM Symp. on Discrete Algorithms, SODA (2009)
Google Scholar
Bartal, Y., Charikar, M., Raz, D.: Approximating min-sum k-clustering in metric spaces. In: Proc. of 33rd ACM Symp. on Theory of Computing, STOC (2001)
Google Scholar
Czumaj, A., Sohler, C.: Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms 30(1-2), 226–256 (2007)
Article MathSciNet MATH Google Scholar
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunesekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R., Bateman, A.: The pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010)
Article Google Scholar
Kleinberg, J.: An impossibility theorem for clustering. In: Proc. of 17th Conference on Neural Information Processing Systems, NIPS (2003)
Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Google Scholar
Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proc. of 12th ACM-SIAM Symp. on Discrete Algorithms, SODA (2001)
Google Scholar
Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)
Article Google Scholar
Voevodski, K., Balcan, M.F., Röglin, H., Teng, S., Xia, Y.: Efficient clustering with limited distance information. In: Proc. of 26th Conference on Uncertainty in Artifcial Intelligence, UAI (2010)
Google Scholar
Zadeh, R.B., Ben-David, S.: A uniqueness theorem for clustering. In: Proc. of 25th Conference on Uncertainty in Artifcial Intelligence, UAI (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Boston University, Boston, MA, 02215, USA
Konstantin Voevodski
College of Computing, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Maria-Florina Balcan
Department of Computer Science, University of Bonn, Bonn, Germany
Heiko Röglin
Computer Science Department, University of Southern California, Los Angeles, CA, 90089, USA
Shang-Hua Teng
Bioinformatics Program and Department of Chemistry, Boston University, Boston, MA, 02215, USA
Yu Xia

Authors

Konstantin Voevodski
View author publications
You can also search for this author in PubMed Google Scholar
Maria-Florina Balcan
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Röglin
View author publications
You can also search for this author in PubMed Google Scholar
Shang-Hua Teng
View author publications
You can also search for this author in PubMed Google Scholar
Yu Xia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DAIS, Università Ca’ Foscari, Via Torino 155, 30172, Venice, Italy
Marcello Pelillo
The University of York, YO1 5DD, Heslington, York, UK
Edwin R. Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Voevodski, K., Balcan, MF., Röglin, H., Teng, SH., Xia, Y. (2011). Min-sum Clustering of Protein Sequences with Limited Distance Information. In: Pelillo, M., Hancock, E.R. (eds) Similarity-Based Pattern Recognition. SIMBAD 2011. Lecture Notes in Computer Science, vol 7005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24471-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-24471-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24470-4
Online ISBN: 978-3-642-24471-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Min-sum Clustering of Protein Sequences with Limited Distance Information

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Clustering huge protein sequence sets in linear time

Accurately clustering biological sequences in linear time by relatedness sorting

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Min-sum Clustering of Protein Sequences with Limited Distance Information

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Clustering huge protein sequence sets in linear time

Accurately clustering biological sequences in linear time by relatedness sorting

Evaluation and improvements of clustering algorithms for detecting remote homologous protein families

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation