Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/640075.640104acmconferencesArticle/Chapter ViewAbstractPublication PagesrecombConference Proceedingsconference-collections
Article

A multi-expert system for the automatic detection of protein domains from sequence information

Published: 10 April 2003 Publication History

Abstract

We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence, and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition or boundary positions between domains. The method was assessed using the domain definitions in SCOP for proteins of known structures and was compared to several other existing methods. Our method improves significantly over the best method available, the semi-manual PFam domain database, while being fully automatic. Our method can also be used to verify domain partitions based on structural data. Few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed.

References

[1]
Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447--470.
[2]
Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl. Acad. Sci. USA 78, 4304--4308.
[3]
Holm, L. & Sander, C. (1994). Parser for protein folding units. Proteins 19, 256--268.
[4]
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536--540.
[5]
Yona, G. & Levitt, M. (2000). Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. In the proceedings of ISMB 2000, 395--406, AAAI press, Menlo Park.
[6]
Kuroda, Y., Tani, K., Matsuo, Y. & Yokoyama, S. (2000). Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics. Protein Sci. 9, 2313--2321.
[7]
George, R. A. & Heringa, J. (2002). Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 48, 672--681.
[8]
Gouzy, J., Corpet, F. & Kahn, D. (1999). Whole genome protein domain analysis using a new method for domain clustering. Comput Chem. 23, 333--340.
[9]
Sonnhammer, E. L. L. & Kahn, D. (1994). Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3, 482--492.
[10]
Park, J. & Teicmann, S. A. (1998). DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 14:2, 144--150.
[11]
Gracy, J. & Argos, P. (1998). Automated protein sequence database classification. I. Integration of copositional similarity search, local similarity search and multiple sequence alignment. II. Delineation of domain boundries from sequence similarity. Bioinformatics 14:2, 164--187.
[12]
Sonnhammer, E. L., Eddy, S. R., Durbin, R. (1997). Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405--420.
[13]
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., & Sonnhammer E. L. (1999). Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260--262.
[14]
Haft, D. H., Loftus, B. J., Richardson, D. L., Yang, F., Eisen, J. A., Paulsen, I. T. & White, O. (2001). TIGRFAMs: a protein family resource for the functional identification of proteins. Nucl. Acids Res. 29, 41--43.
[15]
Ponting, C. P., Schultz, J., Milpetz, F. & Bork, P. (1999). SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucl. Acids Res. 27, 229--232.
[16]
George, R. A. & Heringa, J. (2002). SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol. 316, 839--851.
[17]
Rigden, D. J. (2002). Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Eng. 15, 65--77.
[18]
Guan, X. & Du, L. (1998). Domain identification by clustering sequence alignments. Bioinformatics 14, 783--788.
[19]
Wheelan, S. J., Marchler-Bauer, A. & Bryant, S. H. (2000). Domain size distributions can predict domain boundaries. Bioinformatics 16, 613--618.
[20]
George, D. G., Barker, W. C., Mewes, H. W., Pfeiffer, F. & Tsugita, A. (1996). The PIR-International protein sequence database. Nucl. Acids. Res. 24, 17--20.
[21]
Bairoch, A. & Apweiler, R. (1999). The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27 49--54.
[22]
Hubbard, T. J., Ailey, B., Brenner, S. E., Murzin, A. G. & Chothia, C. (1999). SCOP: a Structural Classification of Proteins database. Nucl. Acids Res. 27, 254--256.
[23]
Westbrook, J., Feng, Z., Jain, S. et al. (2002). The Protein Data Bank: unifying the archive. Nucl. Acids. Res. 30, 245--248.
[24]
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389--3402.
[25]
Yona, G., Linial, N. & Linial, M. (1999). ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins, 37, 360--378.
[26]
Henikoff, J. G. & Henikoff, S. (1996). Using substitution probabilities to improve position-specific scoring matrices. Comp. App. Biosci. 12:2, 135--143.
[27]
Hobohm, U. & Sander, C. (1995). A sequence property approach to searching protein database. J. Mol. Biol. 251, 390--399.
[28]
Ferran, E. A., Pflugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps of Human Protein Sequences. Protein Sci. 3, 507--521.
[29]
Csiszr, I. Information Theoretic Methods in Probability and Statistics. From citeseer.nj.nec.com
[30]
Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915--10919.
[31]
Pazos, F., Helmer-Citterich, M., Ausiello, G. & Valencia, A. (1997). Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271, 511--523.
[32]
Black, S.D. & Mould, D.R. (1991). Development of Hydrophobicity Parameters to Analyze Proteins Which Bear Post or Cotranslational Modifications. Anal. Biochem. 193, 72--82.
[33]
Sowdhamini, R. & Blundell, T. L. (1995). An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins. Protein Sci. 4, 506--520.
[34]
McGuffin, L. J., Bryson, K. & Jones, D. T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404--405.
[35]
Gilbert, W. & Glynias, M. (1993). On the ancient nature of introns. Gene 135, 137--144.
[36]
Gilbert, W., de Souza, S. J. & Long, M. (1997). Origin of genes. Proc. Natl Acad. Sci. USA 94, 7698--7703.
[37]
Saxonov, S., Daizadeh, I., Fedorov, A. & Gilbert, W. (2000). EID: the Exon-Intron Database-an exhaustive database of protein-coding intron-containing genes. Nucl. Acids Res. 28, 185--190.
[38]
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Trans. Info. Theory 37:1, 145--151.
[39]
Kullback, S. (1959). "Information theory and statistics". John Wiley and Sons, New York.
[40]
El-Yaniv, R., Fine, S. & Tishby, N. (1997). Agnostic classification of markovian sequences. Advances in Neural Information Processing Systems 10, 465--471.
[41]
Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J. & Zdobnov, E. M. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 29, 37--40.

Cited By

View all
  • (2006)Multi-class Protein Sequence Classification Using Fuzzy ARTMAP2006 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2006.384960(1676-1681)Online publication date: Oct-2006

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology
April 2003
352 pages
ISBN:1581136358
DOI:10.1145/640075
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 April 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SCOP
  2. domain boundaries
  3. domain prediction
  4. protein domains

Qualifiers

  • Article

Conference

RECOMB03
Sponsor:

Acceptance Rates

RECOMB '03 Paper Acceptance Rate 35 of 175 submissions, 20%;
Overall Acceptance Rate 148 of 538 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2006)Multi-class Protein Sequence Classification Using Fuzzy ARTMAP2006 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2006.384960(1676-1681)Online publication date: Oct-2006

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media