BCH 516-1

GENOMICS AND COMPUTATIONAL BIOLOGY
BCH 516
BUSARI M.B
FEDERAL UNIVERSITY OF TECHNOLOGY
MINNA
busari.bola@futminna.edu.ng
https://
scholar.google.com/citations?user=dxLL0ZoAAAAJ&hl=en
Course Outline
 The concept of genes
 Molecular biology/computational research
 Biological Data and their sources; Cellular,
molecular biology, Biochemistry, evolutionary
biology, DNA and protein sequence data.
 Sequence alignment
 Global and local alignment
 Multiple sequence alignment
 Phylogenic analysis
 Applications of bioinformatics and computational
biology
• Molecular Biology
 Field of biology that studies the composition, structure and interactions of cellular
molecules - such as nucleic acids and proteins – that carry out the biological
processes essential for the cell's functions and maintenance.
• Gene
 Genes are segments of DNA that contain instructions for building the molecules
that make the body function.
• Genome
 All the genetic material in an organism. It is made of DNA (or RNA in some
viruses) and includes genes and other elements that control the activity of those
genes.
• Genomics
 The branch of molecular biology concerned with the structure, function, evolution,
and mapping of genomes.
• Bioinformatics
 Collection and storage of biological information
 Derives knowledge from computer analysis of biological data
• Computational biology
 Development of algorithms and statistical models to analyze biological data
Data Types
 According to the types of data managed in different
databases, biological data bases can roughly fall into
the following categories:
(1) DNA, (2) RNA, (3) protein, (4) expression, (5)
pathway, (6) disease, (7) nomenclature, (8)
literature, and (9) standard and ontology
 Sources of the data can be from;
• Cellular and molecular biology
• Genetics
• Biochemistry
• Evolutionary Biology
DNA SEQUENCING
• The 4 steps of next generation sequencing
(NGS) include nucleic acid isolation, library
preparation, clonal amplification and
sequencing, and data analysis.
• Step 1- Nucleic Acid Extraction and
Isolation. ...
• Step 2- Library Preparation. ...
• Step 3- Clonal Amplification and
Sequencing. ...
• Step 4 -Data Analysis Using Bioinformatics.
DNA DATABASES
 A DNA database centers on managing DNA data
from many or some specific species. The primary
function of human DNA databases includes
establishment of the;
• Reference genome (e.g., NCBI RefSeq)
• Profiling of human genetic variation (e.g., dbSNP)
• Association of genotype with phenotype (e.g., EGA)
• Identification of human microbiome metagenomes
(e.g., IMG/HMP).
 A representative example of DNA database is
GenBank, a collection of all publicly-available DNA
sequences (http://www.ncbi.nlm.nih.gov/genbank)
RNA DATABASES
Only tiny proportion of the human genome is
transcribed into mRNAs, whereas the vast
majority of the genome is transcribed into
“dark matter”—non-coding RNAs (ncRNAs)
that do not encode proteins, including
microRNAs (miRNAs), small nucleolar RNAs
(snoRNAs), piwiRNAs (piRNAs), and long
non-coding RNA (lncRNA).
A representative example of RNA database is
RNAcentral (http://rnacentral.org).
Protein databases
 The purpose of constructing protein databases includes;
• collection of universal proteins (e.g., UniProt)
• Identification of protein families and domains (e.g., Pfam)
• Reconstruction of phylogenetic trees (e.g., TreeFam [24])
• Profiling of protein structures (e.g., PDB).
 A representative example of protein database is PDB, the
main primary database for 3D structures of biological
macromolecules determined by X-ray crystallography and
NMR.
 This was established in 1971, PDB contains 105,465
biological macromolecular structures as of 30 December
2014, in which 27,393 entries belong to human (
http://www.rcsb.org/pdb).
Expression databases
 Expression databases can be used for various
purposes;
• Archiving expression data (e.g., GEO)
• Detecting differential and baseline expression (e.g.,
Expression Atlas)
• Exploring tissue-specific gene expression and
regulation (e.g., TiGER)
• Profiling expression information based on both
RNA and protein data (e.g., Human Protein Atlas).
 A representative case of expression database is
Human Protein Atlas. (http://www.proteinatlas.org).
Pathway databases
 Pathway databases contain biological pathways for
metabolic, signaling, and regulatory pathway
analysis.
 A representative example is KEGG PATHWAY, a
curated biological pathway resource on the
molecular interaction and reaction networks.
 As the core of KEGG, KEGG PATHWAY
integrates many entities that are stored in KEGG
sibling databases, including genes, proteins, RNAs,
chemical compounds, and chemical reactions (
http://www.genome.jp/kegg/pathway.html).
Disease databases
 There are at least 200 forms of cancer in the world, causing 14.6% of
all human deaths.
 Thus, obtaining complete cancer genomes and identifying molecular
mutations and abnormal genes can provide new insights for cancer
prevention, detection, and eventually, personalized treatment.
 Toward this end, there are two well-known cancer projects, viz., The
Cancer Genome Atlas (TCGA) and International Cancer Genome
Consortium (ICGC).
 TCGA, founded in 2006 by the National Cancer Institute and National
Human Genome Research Institute at the National Institutes of
Health, aims to collect a wide diversity of omics data for more than
20 different types of human cancer (http://cancergenome.nih.gov).
Unlike TCGA, ICGC is a voluntary collaborative organization
initiated in 2008 and open to all cancer and genomic researchers in the
world. It aims to obtain a comprehensive description of genomic,
Nomenclature Databases
 Nomenclature Database provides data for all
human genes which have approved symbols.
 Genew is a database that contains Human Gene
symbols, managed by the HUGO Gene
Nomenclature Committee (HGNC) as a
confidential database, containing over 16 000
records.
 Data are integrated with other human gene
databases, e.g. GDB, LocusLink and SWISS-
PROT.
 Mouse Genome Database (MGD) is a database
approved for mice gene symbols.
Gene Ontology Databases
 The Gene Ontology (GO) is a major bioinformatics
initiative to unify the representation of gene and gene
product attributes across all species.
 GO aims to; maintain and develop its controlled
vocabulary of gene and gene product attributes;
• annotate genes and gene products;
• assimilate and disseminate annotation data
 Example of GO is Open Biomedical Ontologies
 Gene nomenclature focuses on gene and gene products.
 But GO focuses on the function of the genes and gene
products.
summary
Why bioinformatics is critical?
 Few people adequately trained in both biology and computer
science
 Genome sequencing, microarrays etc. lead to large amounts

of data to be analyzed
 Leads to important discoveries
 Saves time and money

Why is the relationship between Computer
Science and Biology is essential?
Three main reasons-
First, massive amounts of data have to be stored, analyzed and

made accessible
Second, the nature of the data is often such that a computational

statistical method is necessary. This applies in particular to the
information on the building plans of proteins and spatial
organization of their expression in the cell encoded by the DNA.
Third, there is a strong analogy between the DNA sequence and

a computer program
Key Areas/Scope of Bioinformatics
1. Organizing biological knowledge in database
2. Analysing sequence data
3. Structural Bioinformatics
4. Pharmacological relevance (Population genetics)

1. Organizing biological knowledge in database
 Genbank/Organized DNA sequences - NCBI, EMBL
 Protein sequence databank and its structure and functional

characteristics. For example, SWISSPROT contains verified
protein sequences and more annotations describing the function
of a protein
 Literature database – PUBMED, MEDLINE

2. Analysing sequence data
 Establish the correct order of sequence contigs
 Find the translation and transcription initiation sites, find promoter sites,
define open reading frames (ORF)
 Find splice sites, introns, exons
 Translate the DNA sequence into a protein sequence
 Compare the DNA sequence to known protein sequences in order to
verify exons etc with homologous sequences.
Multiple sequence alignments

 Studying evolutionary aspects, by the construction of phylogenetic trees
 Determining active site residues, and residues specific for subfamilies
 Predicting protein–protein interactions
 Analysing single nucleotide polymorphism to hunt for genetic sources of
diseases.
3. Structural Bioinformatics
 This branch of bioinformatics is concerned with

computational approaches
to predict and analyse the spatial structure of proteins and

nucleic acids.
 multiple sequence alignment, secondary structure, 3D

structure can be predicted with an accuracy above 70 %.
4. Pharmacological relevance
 Drug targets in infectious organisms can be revealed by whole

genome comparisons of infectious and non–infectious organisms.
 The analysis of single nucleotide polymorphisms reveals genes

potentially responsible for genetic diseases.
 Prediction and analysis of protein 3D structure is used to

develop drugs and understand drug resistance.
 Patient databases with genetic profiles, e.g. for cardiovascular

diseases, diabetes, cancer, etc. may play an important role in the
future for individual health care, by integrating personal genetic
profile (population genetics) into diagnosis.
Genomic Browsers
 National Center for Biotechnology information (NCBI)
(http://ncbi.nlm.nih.gov)
 Ensembl Genome Browser (http://www.ensembl.org)
 UCSC Genome Browser (http://genome.ucsc.edu/)
 WormBase (http://www.wormbase.org/)
 AceDB (http://www.acedb.org/)
 FlyBase (http://flybase.bio.indiana.edu/)
Protein databses
• SWISS-PROT/TrEMBL curated protein sequences

http://www.expasy.ch/sprot
• InterPro: Protein families and domains

http://www.ebi.ac.uk/interpro
• EXProt: proteins with experimentally verified functions

http://www.cmbi.nl/exprot
• Protein Information Resource (PIR)

http://pir.georgetown.edu/
NCBI
Continued..
NCBI text search of a protein
Abstract finding by NCBI
Nucleotide search of a typical gene
Continued..
FASTA format
FASTA: FASTA format is a text-
based format for representing either
nucleic acid sequences or protein
sequences, in which base pairs or
protein residues are represented using
single letter codes.

BCH 516-1

Uploaded by

Copyright:

Available Formats

BCH 516-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BCH 516-1

Uploaded by

Copyright:

Available Formats

GENOMICS AND COMPUTATIONAL BIOLOGY

 Genome sequencing, microarrays etc. lead to large amounts

 Leads to important discoveries

 Saves time and money

Three main reasons-

First, massive amounts of data have to be stored, analyzed and

Second, the nature of the data is often such that a computational

Third, there is a strong analogy between the DNA sequence and

1. Organizing biological knowledge in database

2. Analysing sequence data

4. Pharmacological relevance (Population genetics)

 Genbank/Organized DNA sequences - NCBI, EMBL

 Protein sequence databank and its structure and functional

 Literature database – PUBMED, MEDLINE

Multiple sequence alignments

 This branch of bioinformatics is concerned with

to predict and analyse the spatial structure of proteins and

 multiple sequence alignment, secondary structure, 3D

 Drug targets in infectious organisms can be revealed by whole

 The analysis of single nucleotide polymorphisms reveals genes

 Prediction and analysis of protein 3D structure is used to

 Patient databases with genetic profiles, e.g. for cardiovascular

 Ensembl Genome Browser (http://www.ensembl.org)

 UCSC Genome Browser (http://genome.ucsc.edu/)

• SWISS-PROT/TrEMBL curated protein sequences

• InterPro: Protein families and domains

• EXProt: proteins with experimentally verified functions

• Protein Information Resource (PIR)

You might also like