Bioinformatics Definition
Bioinformatics Definition
Bioinformatics Definition
Michael Nilges and Jens P. Linge, Unit de BioInformatique Structurale, Institut Pasteur,
2528 rue du Docteur Roux, F75015 Paris, France
Synonyms
Related: Computational Biology, Computational Molecular Biology, Biocomputing
Definition
Bioinformatics derives knowledge from computer analysis of biological data. These can
consist of the information stored in the genetic code, but also experimental results from
various sources, patient statistics, and scientific literature. Research in bioinformatics includes
method development for storage, retrieval, and analysis of the data. Bioinformatics is a
rapidly developing branch of biology and is highly interdisciplinary, using techniques and
concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and
linguistics. It has many practical applications in different areas of biology and medicine.
Description
The history of computing in biology goes back to the 1920s when scientists were already
thinking of establishing biological laws solely from data analysis by induction (e.g. A.J.
Lotka, Elements of Physical Biology, 1925). However, only the development of powerful
computers, and the availability of experimental data that can be readily treated by
computation (for example, DNA or amino acid sequences and threedimensional structures of
proteins) launched bioinformatics as an independent field. Today, practical applications of
bioinformatics are readily available through the world wide web, and are widely used in
biological and medical research. As the field is rapidly evolving, the very definition of
bioinformatics is still the matter of some debate.
The relationship between computer science and biology is a natural one for several reasons.
First, the phenomenal rate of biological data being produced provides challenges: massive
amounts of data have to be stored, analysed, and made accessible. Second, the nature of the
data is often such that a statistical method, and hence computation, is necessary. This applies
in particular to the information on the building plans of proteins and of the temporal and
spatial organisation of their expression in the cell encoded by the DNA. Third, there is a
strong analogy between the DNA sequence and a computer program (it can be shown that the
DNA represents a Turing Machine).
Analyses in bioinformatics focus on three types of datasets: genome sequences,
macromolecular structures, and functional genomics experiments (e.g. expression data, yeast
twohybrid screens). But bioinformatic analysis is also applied to various other data, e.g.
taxonomy trees, relationship data from metabolic pathways, the text of scientific papers, and
patient statistics. A large range of techniques are used, including primary sequence alignment,
protein 3D structure alignment, phylogenetic tree construction, prediction and classification
of protein structure, prediction of RNA structure, prediction of protein function, and
expression data clustering. Algorithmic development is an important part of bioinformatics,
and techniques and algorithms were specifically developed for the analysis of biological data
(e.g., the dynamic programming algorithm for sequence alignment).
Bioinformatics has a large impact on biological research. Giant research projects such as the
human genome project [4] would be meaningless without the bioinformatics component. The
goal of sequencing projects, for example, is not to corroborate or refute a hypothesis, but to
provide raw data for later analysis. Once the raw data are available, hypotheses may be
formulated and tested in silico. In this manner, computer experiments may answer biological
questions which cannot be tackled by traditional approaches. This has led to the founding of
dedicated bioinformatics research groups as well as to a different work practice in the average
bioscience laboratory where the computer has become an essential research tool.
Three key areas are the organisation of knowledge in databases, sequence analysis, and
structural bioinformatics.
establish the correct order of sequence contigs to obtain one continuous sequence;
find the tranlation and transcription initiation sites, find promoter sites, define open
reading frames (ORF);
translate the DNA sequence into a protein sequence, searching all six frames;
compare the DNA sequence to known protein sequences in order to verify exons etc
with homologuous sequences.
Some completely automated annotation systems have been developed (e.g., GENEQUIZ),
which use a multitude of different programs and methods.
The protein sequences are further analysed to predict function. The function can often be
inferred if a sequence of a homologous protein with known function can be found. Homology
searches are the predominant bioinformatics application, and very efficient search methods
have been developed [3]. The often difficult distinction between orthologous sequences and
paralogous sequences facilitates the functional annotation in the comparison of whole
genomes. Several methods detect glycolysation, myristylation and other sites, and the
prediction of signal peptides in the amino acid sequence give valuable information about the
subcellular location of a protein.
The ultimate goal of sequence annotation is to arrive at a complete functional description of
all genes of an organism. However, function is an illdefined concept. Thus, the simplified
idea of one gene one protein one structure one function cannot take into account
proteins that have multiple functions depending on context (e.g., subcellar location and the
presence of cofactors). Well-known cases of moonlighting proteins are lens crystalline and
phosphoglucose isomerase. Currently, work on ontologies is under way to explicitly define a
vocabulary that can be applied to all organisms even as knowledge of gene and protein roles
in cells is accumulating and changing.
Families of similar sequences contain information on sequence evolution in the form of
specific conservation patters at all sequence positions. Multiple sequence alignments are
useful for
Many complete genomes of microorganisms and a few of eukaryotes are available [4]. By
analysis of entire genome sequences a wealth of additional information can be obtained. The
complete genomic sequence contains not only all protein sequences but also sequences
regulating gene expression. A comparison of the genomes of genetically close organisms
reveals genes responsible for specific properties of the organisms (e.g., infectivity). Protein
interactions can be predicted from conservation of gene order or operon organisation in
different genomes. Also the detection of gene fusion and gene fission (i.e, one protein is split
into two in another genome) events helps to deduce protein interactions.
Structural bioinformatics
This branch of bioinformatics is concerned with computational approaches to predict and
analyse the spatial structure of proteins and nucleic acids. Whereas in many cases the primary
sequence uniquely specifies the threedimensional (3D) structure, the specific rules are not
well understood, and the protein folding problem remains largely unsolved. Some aspects of
protein structure can already be predicted from amino acid content. Secondary structure can
be deduced from the primary sequence with statistics or neural networks. When using a
multiple sequence alignment, secondary structure can be predicted with an accuracy above 70
%.
3D models can be obtained most easily if the 3D structure of a homologous protein is known
(homology modelling, comparative modelling). A homology model can only be as good as the
sequence alignment: whereas protein relationships can be detected at the 20% identity level
and below, a correct sequence alignment becomes very difficult, and the homology model
will be doubtful. From 40 to 50% identity the models are usually mostly correct; however, it
is possible to have 50% identity between two carefully designed protein sequences with
different topology (the socalled JANUS protein). Remote relationships that are undetectable
by sequence comparisons may be detected by sequencetostructurefitness (or threading)
approaches: the search sequence is systematically compared to all known protein structures.
Ab initio predictions of protein 3D structure remains the major challenge; some progress has
been made recently by combining statistical with forcefield based approaches.
Membrane proteins are interesting drug targets. It is estimated that membrane receptors form
50 % of all drug targets in pharmacological research. However, membrane proteins are
underrepresented in the PDB structure database. Since membrane proteins are usually
excluded from structural genomics initiatives due to technical problems, the prediction of
transmembrane helices and solvent accessibility is very important. Modern methods can
predict transmembrane helices with a reliability greater than 70 %.
Understanding the 3D structure of a macromolecule is crucial for understanding its function.
Many properties of the 3D structure cannot be deduced directly from the primary sequence.
Obtaining better understanding of protein function is the driving force behind structural
genomics efforts, which can be thus understood as part of functional genomics. Similar
structure can imply similar function. General structuretofunction relationships can be
obtained by statistical approaches, for example, by relating secondary structure to known
protein function or surface properties to cell location.
The increased speed of structure determination necessary for the structural genomics projects
make an independent validation of the structures (by comparison to expected properties)
particularly important. Structure validation helps to correct obvious errors (e.g., in the
covalent structure) and leads to a more standardized representation of structural data, e.g., by
agreeing on a common atom name nomenclature. The knowledge of the structure quality is a
prerequisite for further use of the structure, e.g in molecular modelling or drug design.
In order to make as much data on the structure and its determination available in the
databases, approaches for automated data harvesting are being developed. Structure
classification schemes, as implemented for example in the SCOP, CATH, and FSSP
databases, elucidate the relationship between protein folds and function and shed light on the
evolution of protein domains.
Combined analysis of structural and genomic data will certainly get more important in the
near future. Protein folds can be analysed for whole genomes. Proteinprotein interactions
predicted on the sequence level, can be studied in more detail on the structure level. Single
Nucleotide Polymorphisms can be mapped on 3D structures of proteins in order to elucidate
specific structural causes of disease.
More detailed aspects of protein function can be obtained also by forcefield based
approaches. Whereas protein function requires protein dynamics, no experimental technique
can observe it directly on an atomic scale, and motions have to be simulated by molecular
dynamics (MD) simulations. Also free energy differences (for example between binding
energies of different protein ligands) can be characterized by MD simulations. Molecular
mechanics or molecular dynamics based approaches are also necessary for homology
modelling and for structure refinement in Xray crystallography and NMR structure
determination.
Drug design exploits the knowledge of the 3D structure of the binding site (or the structure of
the complex with a ligand) to construct potential drugs, for example inhibitors of viral
proteins or RNA. In addition to the 3D structure, a force field is necessary to evaluate the
interaction between the protein and a ligand (to predict binding energies). In virtual screening,
a library of molecules is tested on the computer for their capacities to bind to the
macromolecule.
Pharmacological Relevance
Many aspects of bioinformatics are relevant for pharmacology. Drug targets in infectious
organisms can be revealed by whole genome comparisons of infectious and noninfectious
organisms. The analysis of single nucleotide polymorphisms reveals genes potentially
responsible for genetic deseases. Prediction and analysis of protein 3D structure is used to
develop drugs and understand drug resistance.
Patient databases with genetic profiles, e.g. for cardiovascular diseases, diabetes, cancer, etc.
may play an important role in the future for individual health care, by integrating personal
genetic profile into diagnosis, despite obvious ethical problems. The goal is to analyse a
patients individual genetic profile and compare it with a collection of reference profiles and
other related information. This may improve individual diagnosis, prophylaxis, and therapy.
References
1. Bairoch A, Apweiler R (2000) The SWISSPROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res. 28:4548
2. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,
Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res. 28:23542
3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997)
Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25:33893402
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program
package. Methods Mol. Biol. 132:185219
4. The Genome International Sequencing Consortium (2001) Initial sequencing and analysis
of the human genome. Nature 409:860921
JC Venter et al. (2001) The sequence of the human genome. Science 291:13041351
R.D. Fleischmann et al. (1995) Wholegenome random sequencing and assembly of
haemophilusinfluenzae. Science 269:496512
Glossary
Keyword: Analogous Proteins
Definition:
Two proteins with related folds but unrelated sequences are called analogous. During
evolution, analogous proteins independently developed the same fold.
Keyword: Databank
Definition:
In the biosciences, a databank (or data bank) is a structured set of raw data, most notably
DNA sequences from sequencing projects (e.g. the EMBL and GenBank databases).
Keyword: Database
Definition:
A database (or data base) is a collection of data that is organized so that its contents can easily
be accessed, managed, and modified by a computer. The most prevalent type of database is
the relational database which organizes the data in tables; multiple relations can be
mathematically defined between the rows and columns of each table to yield the desired
information. An object-oriented database stores data in the form of objects which are
organized in hierachical classes that may inherit properties from classes higher in the tree
structure.
In the biosciences, a database is a curated repository of raw data containing annotations,
further analysis, links to other databases. Examples of databases are the SWISSPROT
database for annotated protein sequences or the FlyBase database of genetic and molecular
data for Drosophila melanogaster.
Keyword: Force-field
Definition:
In molecular dynamics and molecular mechanics calculations, the intra- and intermolecular
interactions of a molecule are calculated from a simplified empirical parametrization called a
force field. These include atom masses, charges, dihedral angles, improper angles, van-derWaals and electrostatic interactions, etc.
Keyword: Genome
Definition:
The genome is the gene complement of an organism. A genome sequence comprises the
information of the entire genetic material of an organism.
Keyword: Ontology
Definition:
The word ontology has a long history in philosophy, in which it refers to the study of being as
such. In information science, an ontology is an explicit formal specification of how to
represent the objects, concepts and other entities that are assumed to exist in some area of
interest and the relationships among them.
Keyword: Open Reading Frame (ORF)
Definition:
An opening frame contains a series of codons (base triplets) coding for amino acids without
any termination codons. There are six potential reading frames of an unidentified sequence.
Keyword: Proteome
Definition:
The Proteome is the protein complement expressed by a genome. While the genome is static,
the proteome continually changes in response to external and internal events.
Keyword: Proteomics
Definition:
Proteomics aims at quantifying the expression levels of the complete protein complement (the
proteome) in a cell at any given time. While proteomics research was initially focussed on
two-dimensional gel electrophoresis for protein separation and identification, proteomics now
refers to any procedure that characterizes the function of large sets of proteins. It is thus often
used as a synonym for functional genomics.
Keyword: Threading
Definition:
Threading techniques try to match a target sequence on a library of known three-dimensional
structures by threading the target sequence over the known coordinates. In this manner,
threading tries to predict the three-dimensional structure starting from a given protein
sequence. It is sometimes successful when comparisons based on sequences or sequence
profiles alone fail due to a too low sequence similarity.