Abstract
Multiple string (sequence) alignment is a difficult and important problem in computational biology, where it is central in two related tasks: finding highly conserved subregions or embedded patterns of a set of biological sequences (strings of DNA, RNA or amino acids), and inferring the evolutionary history of a set of taxa from their associated biological sequences. Several precise measures have been proposed for evaluating the goodness of a multiple alignment, but no efficient methods are known which compute the optimal alignment for any of these measures in any but small cases. In this paper, we consider two previously proposed measures, and given two computationaly efficient multiple alignment methods (one for each measure) whose deviation from the optimal value isguaranteed to be less than a factor of two. This is the novel feature of these methods, but the methods have additional virtues as well. For both methods, the guaranteed bounds are much smaller than two when the number of strings is small (1.33 for three strings of any length); for one of the methods we give a related randomized method which is much faster and which gives, with high probability, multiple alignments with fairly small error bounds; and for the other measure, the method given yields a non-obviouslower bound on the value of the optimal alignment.
Similar content being viewed by others
Literature
Altschul, S. 1989. Gap costs for multiple alignment.J. theor. Biol. 138, 297–309.
Altschul, S. and D. Lipman. 1989. Trees, stars, and multiple sequence alignment.SIAM J. appl. Math 49, 197–209.
Argos, P. and M. Vingren. 1990. Sensitivity comparisons of protein amino acid sequences. InMethods in Enzymology. Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, R. Doolittle (Ed.), Vol. 183, pp. 352–365. New York: Academic Press.
Bacon, D. and W. Anderson. 1986. Multiple sequence alignment.J. molec. Biol. 191, 153–161.
Carrillo, H. and D. Lipman. 1988. The multiple sequence alignment problem in biology.SIAM J. appl. Math 48, 1073–1082.
Doolittle, R. 1986.Of Urfs and Orfs: A Primer on How to Analyze Derived Amino Acid Sequences. Mill Valley, CA: University Science Books.
Feng, D. and R. Doolittle. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees.J. molec. Evol. 25, 351–360.
Gusfield, D. 1984.The Steiner Tree Problem in Phylogeny. Technical report No. 334. Yale University Computer Science Department, September 1984.
Johnson, M. and R. Doolittle. 1986. A method for the simultaneous alignment of three or more amino acid sequences.J. molec. Evol. 23, 267–278.
Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. InMammalian Protein Metabolism, H. N. Munro (Ed.), pp. 21–132. New York: Academic Press.
Kou, L., G. Markowsky and L. Berman. 1981. A fast algorithm for steiner trees.ACTA Informatica 15.
Lipman, D., S. Altshul and J. Kececioglu. 1980. A tool for multiple sequence alignment.Proc. natn. Acad. Sci., U.S.A. 86, 4412–4415.
Murata, M., J. Richardson and Joel Sussman. 1985. Simultaneous comparison of three protein sequences.Proc. natn. Acad. Sci., U.S.A.,82, 3073–3077.
Sankoff, D. and R. Cedergren. 1983. Simultaneous comparisons of three or more sequences related by a tree. InTime Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, D. Sankoff and J. Kruskal (Eds), pp. 253–264. Reading, MA: Addison Wesley.
Sankoff, D. and J. Kruskal (Eds). 1983.Time Warps, String Edits, and Macromolecules: the Theory and Pactice of Sequence Comparison. Reading, MA: Addison Wesley.
Schuler, G. D., S. F. Altschul and D. J. Lipman. 1992. A workbench for multiple alignment construction and analysis. InProteins: Structure, Function and Genetics, in press.
Schwarz, R. and M. Dayhoff. 1979. Matrices for detecting distant relationships. InAtlas of Protein Sequences, M. Dayhoff (Ed.), pp. 353–358. National Biomedical Research Foundation.
Tarjan, R. E. 1983.Data Structures and Network Algorithms. Philadelphia, PA: SIAM.
Waterman, M. 1986. Multiple sequence alignment by consensus.Nucleic Acids Res. 14, 9095–9102.
Wong, R. 1980. Worst-case analysis of network design problem heuristics.SIAM J. Algebraic Discrete Methods 1, 51–63.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Gusfield, D. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bltn Mathcal Biology 55, 141–154 (1993). https://doi.org/10.1007/BF02460299
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02460299