Article

A multi-expert system for the automatic detection of protein domains from sequence information

Authors:

Niranjan Nagarajan,

Golan YonaAuthors Info & Claims

RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology

Pages 224 - 234

https://doi.org/10.1145/640075.640104

Published: 10 April 2003 Publication History

Abstract

We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence, and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition or boundary positions between domains. The method was assessed using the domain definitions in SCOP for proteins of known structures and was compared to several other existing methods. Our method improves significantly over the best method available, the semi-manual PFam domain database, while being fully automatic. Our method can also be used to verify domain partitions based on structural data. Few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed.

References

[1]

Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447--470.

[2]

Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl. Acad. Sci. USA 78, 4304--4308.

[3]

Holm, L. & Sander, C. (1994). Parser for protein folding units. Proteins 19, 256--268.

[4]

Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536--540.

[5]

Yona, G. & Levitt, M. (2000). Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. In the proceedings of ISMB 2000, 395--406, AAAI press, Menlo Park.

Digital Library

[6]

Kuroda, Y., Tani, K., Matsuo, Y. & Yokoyama, S. (2000). Automated search of natively folded protein fragments for high-throughput structure determination in structural genomics. Protein Sci. 9, 2313--2321.

[7]

George, R. A. & Heringa, J. (2002). Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 48, 672--681.

[8]

Gouzy, J., Corpet, F. & Kahn, D. (1999). Whole genome protein domain analysis using a new method for domain clustering. Comput Chem. 23, 333--340.

[9]

Sonnhammer, E. L. L. & Kahn, D. (1994). Modular arrangement of proteins as inferred from analysis of homology. Protein Sci. 3, 482--492.

[10]

Park, J. & Teicmann, S. A. (1998). DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics 14:2, 144--150.

[11]

Gracy, J. & Argos, P. (1998). Automated protein sequence database classification. I. Integration of copositional similarity search, local similarity search and multiple sequence alignment. II. Delineation of domain boundries from sequence similarity. Bioinformatics 14:2, 164--187.

[12]

Sonnhammer, E. L., Eddy, S. R., Durbin, R. (1997). Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405--420.

[13]

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., & Sonnhammer E. L. (1999). Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260--262.

[14]

Haft, D. H., Loftus, B. J., Richardson, D. L., Yang, F., Eisen, J. A., Paulsen, I. T. & White, O. (2001). TIGRFAMs: a protein family resource for the functional identification of proteins. Nucl. Acids Res. 29, 41--43.

[15]

Ponting, C. P., Schultz, J., Milpetz, F. & Bork, P. (1999). SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucl. Acids Res. 27, 229--232.

[16]

George, R. A. & Heringa, J. (2002). SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol. 316, 839--851.

[17]

Rigden, D. J. (2002). Use of covariance analysis for the prediction of structural domain boundaries from multiple protein sequence alignments. Protein Eng. 15, 65--77.

[18]

Guan, X. & Du, L. (1998). Domain identification by clustering sequence alignments. Bioinformatics 14, 783--788.

[19]

Wheelan, S. J., Marchler-Bauer, A. & Bryant, S. H. (2000). Domain size distributions can predict domain boundaries. Bioinformatics 16, 613--618.

[20]

George, D. G., Barker, W. C., Mewes, H. W., Pfeiffer, F. & Tsugita, A. (1996). The PIR-International protein sequence database. Nucl. Acids. Res. 24, 17--20.

[21]

Bairoch, A. & Apweiler, R. (1999). The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27 49--54.

[22]

Hubbard, T. J., Ailey, B., Brenner, S. E., Murzin, A. G. & Chothia, C. (1999). SCOP: a Structural Classification of Proteins database. Nucl. Acids Res. 27, 254--256.

[23]

Westbrook, J., Feng, Z., Jain, S. et al. (2002). The Protein Data Bank: unifying the archive. Nucl. Acids. Res. 30, 245--248.

[24]

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389--3402.

[25]

Yona, G., Linial, N. & Linial, M. (1999). ProtoMap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins, 37, 360--378.

[26]

Henikoff, J. G. & Henikoff, S. (1996). Using substitution probabilities to improve position-specific scoring matrices. Comp. App. Biosci. 12:2, 135--143.

[27]

Hobohm, U. & Sander, C. (1995). A sequence property approach to searching protein database. J. Mol. Biol. 251, 390--399.

[28]

Ferran, E. A., Pflugfelder, B. & Ferrara P. (1994). Self-Organized Neural Maps of Human Protein Sequences. Protein Sci. 3, 507--521.

[29]

Csiszr, I. Information Theoretic Methods in Probability and Statistics. From citeseer.nj.nec.com

[30]

Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915--10919.

[31]

Pazos, F., Helmer-Citterich, M., Ausiello, G. & Valencia, A. (1997). Correlated mutations contain information about protein-protein interaction. J. Mol. Biol. 271, 511--523.

[32]

Black, S.D. & Mould, D.R. (1991). Development of Hydrophobicity Parameters to Analyze Proteins Which Bear Post or Cotranslational Modifications. Anal. Biochem. 193, 72--82.

[33]

Sowdhamini, R. & Blundell, T. L. (1995). An automatic method involving cluster analysis of secondary structures for the identification of domains in proteins. Protein Sci. 4, 506--520.

[34]

McGuffin, L. J., Bryson, K. & Jones, D. T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404--405.

[35]

Gilbert, W. & Glynias, M. (1993). On the ancient nature of introns. Gene 135, 137--144.

[36]

Gilbert, W., de Souza, S. J. & Long, M. (1997). Origin of genes. Proc. Natl Acad. Sci. USA 94, 7698--7703.

[37]

Saxonov, S., Daizadeh, I., Fedorov, A. & Gilbert, W. (2000). EID: the Exon-Intron Database-an exhaustive database of protein-coding intron-containing genes. Nucl. Acids Res. 28, 185--190.

[38]

Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Trans. Info. Theory 37:1, 145--151.

Digital Library

[39]

Kullback, S. (1959). "Information theory and statistics". John Wiley and Sons, New York.

[40]

El-Yaniv, R., Fine, S. & Tishby, N. (1997). Agnostic classification of markovian sequences. Advances in Neural Information Processing Systems 10, 465--471.

Digital Library

[41]

Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M. D., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N. J., Oinn, T. M., Pagni, M., Servant, F., Sigrist, C. J. & Zdobnov, E. M. (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucl. Acids Res. 29, 37--40.

Cited By

Mohamed SRubin DMarwala T(2006)Multi-class Protein Sequence Classification Using Fuzzy ARTMAP2006 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2006.384960(1676-1681)Online publication date: Oct-2006
https://doi.org/10.1109/ICSMC.2006.384960

Index Terms

A multi-expert system for the automatic detection of protein domains from sequence information

Recommendations

Graph Spectral Approach for Identifying Protein Domains
BICoB '09: Proceedings of the 1st International Conference on Bioinformatics and Computational Biology

Here we present a simple method based on graph spectral properties to automatically partition multi-domain proteins into individual domains. The identification of structural domains in proteins is based on the assumption that the interactions between ...
Using protein-domain information for multiple sequence alignment
BIBE '12: Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE)

Most approaches to multiple sequence alignment rely on primary-sequence information. External sources of information, however, can give valuable hints to possible sequence homologies that may not be obvious from sequence comparison alone. Given the huge ...
Enhancing Protein Domain Detection Using Domain Co-occurrence and Domain Exclusion
DEXA '12: Proceedings of the 2012 23rd International Workshop on Database and Expert Systems Applications

Among the relevant annotations that can be at-tributed to a protein, domains occupy a key position. Protein domains are sequential and structural motifs that are found independently in different proteins and in different combinations. One of the most ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

RECOMB '03: Proceedings of the seventh annual international conference on Research in computational molecular biology

April 2003

352 pages

ISBN:1581136358

DOI:10.1145/640075

Editors:
Martin Vingron
Max-Planck-Institute for Molecular Genetics, Germany
,
Sorin Istrail
Celera Genomics/Applied Biosystems
,
Pavel Pevzner
University of California at San Diego, CA
,
Michael Waterman
University of Southern California, CA
,
Program Chair:
Webb Miller
The Pennsylvania State University

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 April 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

RECOMB03

Sponsor:

RECOMB03: The Seventh Annual International Conference on Research in Computational Molecular Biology

April 10 - 14, 2003

Berlin, Germany

Acceptance Rates

RECOMB '03 Paper Acceptance Rate 35 of 175 submissions, 20%;

Overall Acceptance Rate 148 of 538 submissions, 28%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
604
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Mohamed SRubin DMarwala T(2006)Multi-class Protein Sequence Classification Using Fuzzy ARTMAP2006 IEEE International Conference on Systems, Man and Cybernetics10.1109/ICSMC.2006.384960(1676-1681)Online publication date: Oct-2006
https://doi.org/10.1109/ICSMC.2006.384960

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents