Nothing Special   »   [go: up one dir, main page]

Coursera BioinfoMethods-I Lecture01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Bioinformatic Methods I

Welcome to Bioinformatic Methods I!


Instructor: Nicholas Provart
Nicholas Provart is an associate professor in the Department of Cell & Systems
Biology at the University of Toronto. Hes taught a course on which this Coursera
course is based since 2009 to approximately 700 undergraduate University of
Toronto biology students. His involvement with bioinformatics goes back to 1998.
He was Director of the Collaborative Graduate Program in Genome Biology &
Bioinformatics from 2006-2011, and is one of the founding members of the
International Arabidopsis Informatics Consortium.
Please use the Coursera tools to discuss lecture content and labs.

Course material developed by Ryan Austin, David Guttman, Laura Hug, Momoko Price, and Nicholas Provart
Course produced by Jamie Waese, Rohan Patel, William Heikoop, and Nicholas Provart

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 1

Course format and syllabus


This Coursera course will cover the basics of searching one of the main
repositories of sequence information, NCBIs GenBank, using GQuery/Entrez and
Blast, along with creating sequence alignments and phylogenies. Selection
analysis will also be covered, as will next generation sequence analysis and
metagenomics. Most tools used for exploration are web-based.
Week
1
2
3
4
5
6

Topic
NCBI/Blast I
Blast II/Comparative Genomics
Multiple Sequence Alignments
Phylogenetics
Selection Analysis
NGS Analysis / Metagenomics .

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 2

What is bioinformatics?
Bioinformatics
is the development and application of computational tools in managing all
kinds of biological data
involves the technology that uses computers for storage, retrieval,
manipulation, and distribution of information related to biological
macromoleculates such as DNA, RNA, proteins and metabolites
generally limited to sequence, structural, and functional analysis of genes
and genomes and their corresponding products
sometimes called computational molecular biology
This field has developed over the past decade or so to help manage the
huge increase in data generated by genome sequencing projects, highthroughput technologies etc.

Xiong (2006) Essential Bioinformatics, Cambridge University Press.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 3

Why bioinformatics?
>gi|27500381:c623297-542205 Homo sapiens chromosome 17 genomic contig
AAAACTGCGACTGCGCGGCGTGAGCTCGCTGAGACTTCCTGGACGGGGGACAGGCTGTGGGGTTTCTCAG
ATAACTGGGCCCCTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGGGTAAAGGTAGTAGAGTCCCGGGAAA
GGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAGTGGATTTCCGAAGCTGACAGATGG
GTATTCTTTGACGGGGGGTAGGGGCGGAACCTGAGAGGCGTAAGGCGTTGTGAACCCTGGGGAGGGGGGC
AGTTTGTAGGTCGCGAGGGAAGCGCTGAGGATCAGGAAGGGGGCACTGAGTGTCCGTGGGGGAATCCTCG
TGATAGGAACTGGAATATGCCTTGAGGGGGACACTATGTCTTTAAAAACGTCGGCTGGTCATGAGGTCAG
GAGTTCCAGACCAGCCTGACCAACGTGGTGAAACTCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCG
TGGTGCCGCTCCAGCTACTCAGGAGGCTGAGGCAGGAGAATCGCTAGAACCCGGGAGGCGGAGGTTGCAG
TGAGCCGAGATCGCGCCATTGCACTCCAGCCTGGGCGACAGAGCGAGACTGTCTCAAAACAAAACAAAAC
AAAACAAAACAAAAAACACCGGCTGGTATGTATGAGAGGATGGGACCTTGTGGAAGAAGAGGTGCCAGGA
ATATGTCTGGGAAGGGGAGGAGACAGGATTTTGTGGGAGGGAGAACTTAAGAACTGGATCCATTTGCGCC
ATTGAGAAAGCGCAAGAGGGAAGTAGAGGAGCGTCAGTAGTAACAGATGCTGCCGGCAGGGATGTGCTTG
AGGAGGATCCAGAGATGAGAGCAGGTCACTGGGAAAGGTTAGGGGCGGGGAGGCCTTGATTGGTGTTGGT
TTGGTCGTTGTTGATTTTGGTTTTATGCAAGAAAAAGAAAACAACCAGAAACATTGGAGAAAGCTAAGGC
TACCACCACCTACCCGGTCAGTCACTCCTCTGTAGCTTTCTCTTTCTTGGAGAAAGGAAAAGACCCAAGG
GGTTGGCAGCAATATGTGAAAAAATTCAGAATTTATGTTGTCTAATTACAAAAAGCAACTTCTAGAATCT
TTAAAAATAAAGGACGTTGTCATTAGTTCTTTGGTTTGTATTATTCTAAAACCTTCCAAATCTTAAATTT
ACTTTATTTTAAAATGATAAAATGAAGTTGTCATTTTATAAACCTTTTAAAAAGATATATATATATGTTT
TTCTAATGTGTTAAAGTTCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACA
AAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGGTAAGTCAGCACAAGAGTGTATTAA
TTTGGGATTCCTATGATTATCTCCTATGCAAATGAACAGAATTGACCTTACATACTAGGGAAGAAAAGAC
ATGTCTAGTAAGATTAGGCTATTGTAATTGCTGATTTCCTTAACTGAAGAACTTTAAAAATATAGAAAAT
GATTCCTTGTTCTCCATCCACTCTGCCTCTCCCACTCCTCTCCTTTTCAACACAAATCCTGTGGTCCGGG
AAAGACAGGGACTCTGTCTTGATTGGTTCTGCACTGGGGCAGGAATCTAGTTTAGATTAACTGGCATTTT
GGCTTTTCTTCCAGCTCTAAAACAAGCTCCATCACTTGAAATGGCAAAATAAAATCATGGATGAGGCCGA
GGGCGGTGGCTTATGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGTAGGATCACGAGGTCAGGAGA
TCGAGACCATCCTGGCCAACATGGTGAAACCCCCTCTCCACTAAAAATACAAAAATTAGCTGGGCGTAGT

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 4

Biological Databases
Outline

Why databases?
What is a database?
Data structures: Flat File and Relational
Accession numbers and identifiers
A practical example of utility GQuery/Entrez

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 5

Why databases?

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 6

Why databases?
Genome and genomic sequences
Gene sequences, mutations
Gene regulation
Gene expression (where and when)
Intron splice variants
Protein sequence, post-translational
modifications
Protein tertiary structure (3D)
Protein networks
Protein localization
Enzyme Kinetics
Metabolites, metabolic networks
Diseases
Literature

To archive accumulated knowledge and to


provide scientists with easy access to
biological data

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 7

What is a database?
How can data be stored...
zzz

Flat-file format, with fields separated by some delimiter


zzz

Nancy|Dengler|Botany|University of Toronto|25 Willocks St, Toronto, ON. M5S 3B2


Peter|Lewis|Dept. of Biochemistry|Uni. Toronto|1 Kings College Circle, Toronto, ON. M5S 1A8
John|Coleman|Department of Botany|University of Toronto|25 Willcocks St, Toronto, ON. M5S 3B2
John|Coleman|Dept. of Biology|York University|4700 Keele St, Toronto, ON. M3J 1P3

These data could also be stored in a spreadsheet


First_name

Last_name

Institution

Department

Address

Nancy

Dengler

University of Toronto

Botany

25 Willocks St, Toronto, ON. M5S 3B2

Peter

Lewis

Uni. Toronto

Dept. of Biochemistry

1 Kings College Circle, Toronto, ON. M5S 1A8

John

Coleman

University of Toronto

Department of Botany

25 Willcocks St, Toronto, ON. M5S 3B2

John

Coleman

York University

Dept. of Biology

4700 Keele St, Toronto, ON. M3J 1P3

What are the problems with this sort of database?

Relational Databases offer a solution...

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 8

Relational Databases
Nancy|Dengler|Botany|University of Toronto|25 Willocks St, Toronto, ON. M5S 3B2
Peter|Lewis|Dept. of Biochemistry|Uni. Toronto|1 Kings College Circle, Toronto, ON. M5S 1A8
John|Coleman|Department of Botany|University of Toronto|25 Willcocks St, Toronto, ON. M5S 3B2
John|Coleman|Dept. of Biology|York University|4700 Keele St, Toronto, ON. M3J 1P3

A relational database consists of a relations (tables) containing attributes


(fields or columns). Each row in a table is known as a tuple or a record.
Information should be normalized so that it is non-redundant  this means
that every row should be unique, although this ideal is not always observed.

Table 'Professors'

Professor_id
1
2
3
4

First_name
Nancy
Peter
John
John

Last_name
Dengler
Lewis
Coleman
Coleman

Contact_id
1
2
1
3

Institution
University of Toronto
Uni. Toronto
York University

Department
Dept. of Botany
Dept. of Biochemisty
Dept. of Biology

Address
25 Willocks St, Toronto, ON. M5S 3B2
1 Kings College Circle, Toronto, ON. M5S 1A8
4700 Keele St, Toronto, ON. M3J 1P3

primary key

Table 'Contacts'

Contact_id
1
2
3

foreign key

primary key

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 9

Accession codes, identifiers, GIs, etc.


Many of the biolological databases (GenBank, UNIPROT etc.) have two (or
more!) different ways of identifying a given entry:

Identifier
Accession code (or number)

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 10

Accession codes, identifiers, GIs, etc. [2]


Identifier
zzz

An identifier ("locus" in GenBank, "entry name" in UNIPROT) is a string of


letters and digits that understandable in some meaningful way by a human.
Identifiers are not as stable as accession numbers, mainly because they are
modified by the curators if the presumed function of the protein is found to be
something else.
UNIPROT: ADH6_HUMAN
GenBank: HUMADH6A01
An identifier can change. For example, the database curators may decide that
the identifier for an entry no longer is appropriate. This does not happen very
often.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 11

Accession codes, identifiers, GIs, etc. [3]


Accession code (number)
zzz

An accession code (or number) is a number (with a few characters in front)


that uniquely identifies an entry. It is often assigned arbitrarily. For example, the
accession code for ADH6_HUMAN in UNIPROT is P28332.
In the case of GenBank, the accession code for the human ADH6 gene
sequence is AH001409.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 12

Accession codes, identifiers, GIs, etc. [4]


Versions and Gene Indices
In 1992, NCBI began assigning a unique number for each sequence
submitted the Gene Index (GI) number, even for updated versions of a
given accession (unless the two are identical). The same accession number
may be associated with a different GI if a newer or corrected sequence is
submitted.
Records typically contain the Accession.Version identifier, such as M84402.1,
in the VERSION line of the record. This identifier is mapped to its unique
corresponding GI number, which is the primary key of GenBank.
To specify a sequence exactly in GenBank, use either its GI or
Accession.Version. To retrieve the most up-to-date sequence, use the
accession number without version: the most up-to-date sequence will be
retrieved automatically.
Lets look at the GenBank record for human alcohol dehydrogenase VI
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=178145...

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 13

GenBank Flatfile Format (GBFF)

The GenBank flatfile format (GBFF) is one of the most commonly used formats
used for nucleotide sequences. It contains all of the information associated with
the sequence, as well as the sequence itself.
The GBFF has 3 parts: the header, the features, and the sequence itself.
LOCUS

HUMADH6A01

409 bp

DNA

identifier

length

source type

Bioinformatic Methods I

linear

PRI 17-OCT-2000

NCBI entry date


taxonomic group
N. Provart Intro for Lab 1 Slide 14

GenBank Flatfile Format Header


DEFINITION
ACCESSION
VERSION
KEYWORDS
SEGMENT

Homo sapiens alcohol dehydrogenase 6 (ADH6) gene, exon 1.


M84402 M68895
M84402.1 GI:178137
alcohol dehydrogenase; alcohol dehydrogenase VI.
1 of 8

DEFINITION: The biology of the molecule in a sentence.


ACCESSION: Code(s)
VERSION: Number; GI number found on this line too.
KEYWORDS: Keywords as defined by the submittersbut: free-text.
SEGMENT: Appears for multi-exon records. Each record is actually stored
separately, and then is bundled together on-the-fly for presentation here.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 15

GenBank Flatfile Format Header, cont.


SOURCE
ORGANISM

REFERENCE
AUTHORS
TITLE
JOURNAL
MEDLINE
PUBMED

Homo sapiens.
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
1 (bases 1 to 409)
Yasunami,M., Chen,C.S. and Yoshida,A.
A human alcohol dehydrogenase gene (ADH6) encoding an additional
class of isozyme
Proc. Natl. Acad. Sci. U.S.A. 88 (17), 7610-7614 (1991)
91352038
1881901

SOURCE: Contains organism name


ORGANISM: Contains complete taxonomic information from the NCBI
taxonomy server.

REFERENCE: Details on a publication about the sequence.


COMMENT: Contains misc. information and revision details.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 16

GenBank Flatfile Format Features


FEATURES
source

misc_signal
exon

Location/Qualifiers
1..409
/organism="Homo sapiens
/db_xref="taxon:9606
/sex="male
/tissue_type="liver
34..48
287..396
/gene="ADH6

A direct representation of the biological information in the record.


The Source Feature must be present in all GenBank records, and contains
information as to where the molecule comes from /organism = Homo
sapiens, and, potentially, map, chromosome and tissue type information.
The exon feature tells one that the sequence from 287..396 comprises an
exon.

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 17

GenBank Flatfile Format Features, cont.


In some records the CDS feature is present, and looks like this for X59698, the
nucleotide sequence of an EGF-receptor from mouse:
zzz

FEATURES
CDS

sig_peptide
mat_peptide

Location/Qualifiers
160..>2301
/codon_start=1
/product="EGF-receptor
/protein_id="CAA42219.1
/db_xref="GI:50804
/db_xref="MGD:95294
/db_xref="SWISS-PROT:Q01279
/translation="MRPSGTARTTLLVLLTALCAAGGALEEKKVCQGTSNRLTQLGTF
EDHFLSLQRMYNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLE
NLQIIRGNALYENTYALAILSNYGTNRTGLRELPMRNLQEILIGAVRFSNNPILCNMD
TIQWRDIVQNVFMSNMSMDLQSHPSSCPKCDPSCPNGSCWGGGEENCQKLTKIICAQQ
CSHRCRGRSPSDCCHNQCAAGCTGPRESDCLVCQKFQDEATCKDTCPPLMLYNPTTYQ
MDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGPDYYEVEEDGIRKCKKCDGPCR
KVCNGIGIGEFKDTLSINATNIKHFKYCTAISGDLHILPVAFKGDSFTRTPPLDPREL
EILKTVKEITGFLLIQAWPDNWTDLHAFENLEIIRGRTKQHGQFSLAVVGLNITSLGL
RSLKEISDGDVIISGNRNLCYANTINWKKLFGTPNQKTKIMNNRAEKDCKAVNHVCNP
LCSSEGCWGPEPRDCVSCQNVSRGRECVEKWNILEGEPREFVENSECIQCHPECLPQA
MNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGIMGENNTLVWKYADANNVCHLCHANC
TYGCAGPGLQGCEVWPSGPKIPSIATGIVGGLLFIVVVALGIGLFMRRRHIVRKRTLR
RLLQERELVEPLTPSGEAPNQAHLRILKETEF
160..231
232..>2301

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 18

GenBank Flatfile Format Sequence


The last part of the GenBank flat file record is the sequence itself. A summary
of nucleotide composition is presented, and the sequence follows.
BASE COUNT
133 a
75 c
ORIGIN
1 tgtattttga aaacaacaga
61 tggacattta aaagtccaaa
121 gattaaggga gaaaaaaata
181 ctatttcaga ttacacttag
241 aataattacc agactacaga
301 cctttgtact ttctacagtg
361 gcggtggaga aaatcagcat

77 g
aaagaaatac
tttaaaactc
gtttgcattt
gaacttccat
gaaggtcgga
aaagttgcta
gagtactaca

124 t
ttttgtacac
aaaaaaatgg
tcaccttttg
caagcacggg
ccagccttct
caggatctcc
ggccaagtag

tctgttagaa
ataataagag
gctctttcac
agagcctact
gatctacagt
ctttctcaat
gtgcagtat

Bioinformatic Methods I

attttaagtt
ggacctgttt
tgagatgagc
tttcctgttt
cgcctgtgta
aaattcatct

N. Provart Intro for Lab 1 Slide 19

Nucleotide Databases Growth of GenBank


from http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

2005

Bioinformatic Methods I

2008
N. Provart Intro for Lab 1 Slide 20

10

Searching GenBank + other sequence DBs


by keyword
by sequence similarity, using BLAST* (http://www.ncbi.nlm.nih.gov/BLAST/)

*Google doesnt handle


sequence searches
well: it cant put in gaps
to identify partial
matches to similar
sequences, and it
doesnt know which
amino acids have
similar properties!

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 21

Definitions

Gene duplication results


in two copies in a
common ancestor of
frog, chick, and mouse
Just one copy of globin
in ancient organism
from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 22

11

Searching across DBs: the GQuery/Entrez and SRS systems.


Several publicly-available tools are available for querying across datasets. Two
of these are good starting points. One is provided by the NCBI and is called
GQuery/Entrez (http://www.ncbi.nlm.nih.gov/gquery/), and the other is provided
by EBI and is called SRS (Sequence Retrieval System, http://srs.ebi.ac.uk/).
Gquery/Entrez provides links between many of the databases at NCBI.
Well go through an example using GQuery/Entrez...

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 23

Sample Problem
Identify the SNPs which potentially cause early onset breast cancer, and design
oligos to PCR them in samples of human genomic DNA for sequencing. Use the
OMIM function of GQuery/Entrez. OMIM provides links to everything that is
known about a given disease across the various databases at NCBI.

http://www.ncbi.nih.gov/Database/datamodel/index.html

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 24

12

Sample Problem [2]

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 25

Sample Problem [3]

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 26

13

Sample Problem [4]

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 27

Sample Problem [5]

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 28

14

Primer3 can then be used to design PCR primers


Use it at http://frodo.wi.mit.edu/

Steve Rozen and Helen J. Skaletsky (2000), in: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in
Molecular Biology. Humana Press, Totowa, NJ, pp 365-386

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 29

Which Database for What?

Bioinformatic Methods I

N. Provart Intro for Lab 1 Slide 30

15

You might also like