Abstract
We study the problem of efficiently clustering protein sequences in a limited information setting. We assume that we do not know the distances between the sequences in advance, and must query them during the execution of the algorithm. Our goal is to find an accurate clustering using few queries. We model the problem as a point set S with an unknown metric d on S, and assume that we have access to one versus all distance queries that given a point s ∈ S return the distances between s and all other points. Our one versus all query represents an efficient sequence database search program such as BLAST, which compares an input sequence to an entire data set. Given a natural assumption about the approximation stability of the min-sum objective function for clustering, we design a provably accurate clustering algorithm that uses few one versus all queries. In our empirical study we show that our method compares favorably to well-established clustering algorithms when we compare computationally derived clusterings to gold-standard manual classifications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33(3) (2004)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Proc. of 23rd Conference on Neural Information Processing Systems, NIPS (2009)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proc. of 18th ACM-SIAM Symp. on Discrete Algorithms, SODA (2007)
Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: Proc. of 20th ACM-SIAM Symp. on Discrete Algorithms, SODA (2009)
Bartal, Y., Charikar, M., Raz, D.: Approximating min-sum k-clustering in metric spaces. In: Proc. of 33rd ACM Symp. on Theory of Computing, STOC (2001)
Czumaj, A., Sohler, C.: Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms 30(1-2), 226–256 (2007)
Finn, R.D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J.E., Gavin, O.L., Gunesekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E.L., Eddy, S.R., Bateman, A.: The pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010)
Kleinberg, J.: An impossibility theorem for clustering. In: Proc. of 17th Conference on Neural Information Processing Systems, NIPS (2003)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Mishra, N., Oblinger, D., Pitt, L.: Sublinear time approximate clustering. In: Proc. of 12th ACM-SIAM Symp. on Discrete Algorithms, SODA (2001)
Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic Acids Res. 34(5), 1571–1580 (2006)
Voevodski, K., Balcan, M.F., Röglin, H., Teng, S., Xia, Y.: Efficient clustering with limited distance information. In: Proc. of 26th Conference on Uncertainty in Artifcial Intelligence, UAI (2010)
Zadeh, R.B., Ben-David, S.: A uniqueness theorem for clustering. In: Proc. of 25th Conference on Uncertainty in Artifcial Intelligence, UAI (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Voevodski, K., Balcan, MF., Röglin, H., Teng, SH., Xia, Y. (2011). Min-sum Clustering of Protein Sequences with Limited Distance Information. In: Pelillo, M., Hancock, E.R. (eds) Similarity-Based Pattern Recognition. SIMBAD 2011. Lecture Notes in Computer Science, vol 7005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24471-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-24471-1_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24470-4
Online ISBN: 978-3-642-24471-1
eBook Packages: Computer ScienceComputer Science (R0)