Biology
Biology
Biology
Please use the Coursera tools to discuss lecture content and labs.
Course material developed by Ryan Austin, David Guttman, Laura Hug, Momoko Price, and Nicholas Provart
Course produced by Jamie Waese, Rohan Patel, William Heikoop, and Nicholas Provart
This Coursera course will cover the basics of searching one of the main
repositories of sequence information, NCBI’s GenBank, using GQuery/Entrez and
Blast, along with creating sequence alignments and phylogenies. Selection
analysis will also be covered, as will next generation sequence analysis and
metagenomics. Most tools used for exploration are web-based.
Week Topic
1 NCBI/Blast I
2 Blast II/Comparative Genomics
3 Multiple Sequence Alignments
4 Phylogenetics
5 Selection Analysis
6 NGS Analysis / Metagenomics .
1
What is bioinformatics?
Bioinformatics
• is the development and application of computational tools in managing all
kinds of biological data
• involves the technology that uses computers for storage, retrieval,
manipulation, and distribution of information related to biological
macromoleculates such as DNA, RNA, proteins and metabolites
• generally limited to sequence, structural, and functional analysis of genes
and genomes and their corresponding products
• sometimes called computational molecular biology
This field has developed over the past decade or so to help manage the
huge increase in data generated by genome sequencing projects, high-
throughput technologies etc.
Why bioinformatics?
>gi|27500381:c623297-542205 Homo sapiens chromosome 17 genomic contig
AAAACTGCGACTGCGCGGCGTGAGCTCGCTGAGACTTCCTGGACGGGGGACAGGCTGTGGGGTTTCTCAG
ATAACTGGGCCCCTGCGCTCAGGAGGCCTTCACCCTCTGCTCTGGGTAAAGGTAGTAGAGTCCCGGGAAA
GGGACAGGGGGCCCAAGTGATGCTCTGGGGTACTGGCGTGGGAGAGTGGATTTCCGAAGCTGACAGATGG
GTATTCTTTGACGGGGGGTAGGGGCGGAACCTGAGAGGCGTAAGGCGTTGTGAACCCTGGGGAGGGGGGC
AGTTTGTAGGTCGCGAGGGAAGCGCTGAGGATCAGGAAGGGGGCACTGAGTGTCCGTGGGGGAATCCTCG
TGATAGGAACTGGAATATGCCTTGAGGGGGACACTATGTCTTTAAAAACGTCGGCTGGTCATGAGGTCAG
GAGTTCCAGACCAGCCTGACCAACGTGGTGAAACTCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCG
TGGTGCCGCTCCAGCTACTCAGGAGGCTGAGGCAGGAGAATCGCTAGAACCCGGGAGGCGGAGGTTGCAG
TGAGCCGAGATCGCGCCATTGCACTCCAGCCTGGGCGACAGAGCGAGACTGTCTCAAAACAAAACAAAAC
AAAACAAAACAAAAAACACCGGCTGGTATGTATGAGAGGATGGGACCTTGTGGAAGAAGAGGTGCCAGGA
ATATGTCTGGGAAGGGGAGGAGACAGGATTTTGTGGGAGGGAGAACTTAAGAACTGGATCCATTTGCGCC
ATTGAGAAAGCGCAAGAGGGAAGTAGAGGAGCGTCAGTAGTAACAGATGCTGCCGGCAGGGATGTGCTTG
AGGAGGATCCAGAGATGAGAGCAGGTCACTGGGAAAGGTTAGGGGCGGGGAGGCCTTGATTGGTGTTGGT
TTGGTCGTTGTTGATTTTGGTTTTATGCAAGAAAAAGAAAACAACCAGAAACATTGGAGAAAGCTAAGGC
TACCACCACCTACCCGGTCAGTCACTCCTCTGTAGCTTTCTCTTTCTTGGAGAAAGGAAAAGACCCAAGG
GGTTGGCAGCAATATGTGAAAAAATTCAGAATTTATGTTGTCTAATTACAAAAAGCAACTTCTAGAATCT
TTAAAAATAAAGGACGTTGTCATTAGTTCTTTGGTTTGTATTATTCTAAAACCTTCCAAATCTTAAATTT
ACTTTATTTTAAAATGATAAAATGAAGTTGTCATTTTATAAACCTTTTAAAAAGATATATATATATGTTT
TTCTAATGTGTTAAAGTTCATTGGAACAGAAAGAAATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACA
AAATGTCATTAATGCTATGCAGAAAATCTTAGAGTGTCCCATCTGGTAAGTCAGCACAAGAGTGTATTAA
TTTGGGATTCCTATGATTATCTCCTATGCAAATGAACAGAATTGACCTTACATACTAGGGAAGAAAAGAC
ATGTCTAGTAAGATTAGGCTATTGTAATTGCTGATTTCCTTAACTGAAGAACTTTAAAAATATAGAAAAT
GATTCCTTGTTCTCCATCCACTCTGCCTCTCCCACTCCTCTCCTTTTCAACACAAATCCTGTGGTCCGGG
AAAGACAGGGACTCTGTCTTGATTGGTTCTGCACTGGGGCAGGAATCTAGTTTAGATTAACTGGCATTTT
GGCTTTTCTTCCAGCTCTAAAACAAGCTCCATCACTTGAAATGGCAAAATAAAATCATGGATGAGGCCGA
GGGCGGTGGCTTATGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGTAGGATCACGAGGTCAGGAGA
TCGAGACCATCCTGGCCAACATGGTGAAACCCCCTCTCCACTAAAAATACAAAAATTAGCTGGGCGTAGT
2
Biological Databases
Outline
• Why databases?
• What is a database?
• Data structures: Flat File and Relational
• Accession numbers and identifiers
• A practical example of utility – GQuery/Entrez
Why databases?
3
Why databases?
What is a database?
Nancy Dengler University of Toronto Botany 25 Willocks St, Toronto, ON. M5S 3B2
Peter Lewis Uni. Toronto Dept. of Biochemistry 1 King’s College Circle, Toronto, ON. M5S 1A8
John Coleman University of Toronto Department of Botany 25 Willcocks St, Toronto, ON. M5S 3B2
John Coleman York University Dept. of Biology 4700 Keele St, Toronto, ON. M3J 1P3
4
Relational Databases
Nancy|Dengler|Botany|University of Toronto|25 Willocks St, Toronto, ON. M5S 3B2
Peter|Lewis|Dept. of Biochemistry|Uni. Toronto|1 King’s College Circle, Toronto, ON. M5S 1A8
John|Coleman|Department of Botany|University of Toronto|25 Willcocks St, Toronto, ON. M5S 3B2
John|Coleman|Dept. of Biology|York University|4700 Keele St, Toronto, ON. M3J 1P3
primary key
Many of the biolological databases (GenBank, UNIPROT etc.) have two (or
more!) different ways of identifying a given entry:
• Identifier
• Accession code (or number)
5
Accession codes, identifiers, GIs, etc. [2]
Identifier
zzz
Identifiers are not as stable as accession numbers, mainly because they are
modified by the curators if the presumed function of the protein is found to be
something else.
UNIPROT: ADH6_HUMAN
GenBank: HUMADH6A01
An identifier can change. For example, the database curators may decide that
the identifier for an entry no longer is appropriate. This does not happen very
often.
6
Accession codes, identifiers, GIs, etc. [4]
The GenBank flatfile format (GBFF) is one of the most commonly used formats
used for nucleotide sequences. It contains all of the information associated with
the sequence, as well as the sequence itself.
The GBFF has 3 parts: the header, the features, and the sequence itself.
7
GenBank Flatfile Format – Header
DEFINITION Homo sapiens alcohol dehydrogenase 6 (ADH6) gene, exon 1.
ACCESSION M84402 M68895
VERSION M84402.1 GI:178137
KEYWORDS alcohol dehydrogenase; alcohol dehydrogenase VI.
SEGMENT 1 of 8
• ACCESSION: Code(s)
8
GenBank Flatfile Format – Features
FEATURES Location/Qualifiers
source 1..409
/organism="Homo sapiens“
/db_xref="taxon:9606“
/sex="male“
/tissue_type="liver“
misc_signal 34..48
exon 287..396
/gene="ADH6“
In some records the CDS feature is present, and looks like this for X59698, the
nucleotide sequence of an EGF-receptor from mouse:
zzz
FEATURES Location/Qualifiers
CDS 160..>2301
/codon_start=1
/product="EGF-receptor“
/protein_id="CAA42219.1“
/db_xref="GI:50804“
/db_xref="MGD:95294“
/db_xref="SWISS-PROT:Q01279“
/translation="MRPSGTARTTLLVLLTALCAAGGALEEKKVCQGTSNRLTQLGTF
EDHFLSLQRMYNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLE
NLQIIRGNALYENTYALAILSNYGTNRTGLRELPMRNLQEILIGAVRFSNNPILCNMD
TIQWRDIVQNVFMSNMSMDLQSHPSSCPKCDPSCPNGSCWGGGEENCQKLTKIICAQQ
CSHRCRGRSPSDCCHNQCAAGCTGPRESDCLVCQKFQDEATCKDTCPPLMLYNPTTYQ
MDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGPDYYEVEEDGIRKCKKCDGPCR
KVCNGIGIGEFKDTLSINATNIKHFKYCTAISGDLHILPVAFKGDSFTRTPPLDPREL
EILKTVKEITGFLLIQAWPDNWTDLHAFENLEIIRGRTKQHGQFSLAVVGLNITSLGL
RSLKEISDGDVIISGNRNLCYANTINWKKLFGTPNQKTKIMNNRAEKDCKAVNHVCNP
LCSSEGCWGPEPRDCVSCQNVSRGRECVEKWNILEGEPREFVENSECIQCHPECLPQA
MNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGIMGENNTLVWKYADANNVCHLCHANC
TYGCAGPGLQGCEVWPSGPKIPSIATGIVGGLLFIVVVALGIGLFMRRRHIVRKRTLR
RLLQERELVEPLTPSGEAPNQAHLRILKETEF“
sig_peptide 160..231
mat_peptide 232..>2301
9
GenBank Flatfile Format – Sequence
The last part of the GenBank flat file record is the sequence itself. A summary
of nucleotide composition is presented, and the sequence follows.
2005 2008
10
Searching GenBank + other sequence DBs
by keyword
by sequence similarity, using BLAST* (http://www.ncbi.nlm.nih.gov/BLAST/)
Definitions
11
Searching across DBs: the Entrez and SRS systems.
Sample Problem
Identify the SNPs which potentially cause early onset breast cancer, and design
oligos to PCR them in samples of human genomic DNA for sequencing. Use the
OMIM “function” of Entrez/GQuery. OMIM has links to everything that is known
about a given disease across the various databases at NCBI.
http://www.ncbi.nih.gov/Database/datamodel/index.html
12
Sample Problem [2]
13
Sample Problem [5]
14
Primer3 can then be used to design PCR primers
Use it at http://frodo.wi.mit.edu/
Steve Rozen and Helen J. Skaletsky (2000), in: Krawetz S, Misener S (eds) Bioinformatics Methods and Protocols: Methods in
Molecular Biology. Humana Press, Totowa, NJ, pp 365-386
15