Nothing Special   »   [go: up one dir, main page]

Exer 5 - BIOINFORMATICS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

EXERCISE NO.

____A
Bioinformatics: Name that Gene

NTRODUCTION

Bioinformatics is a field that combines statistics, mathematical modeling, and computer science to analyze biological data. Using
bioinformatics methods, entire genomes can be quickly compared to detect genetic similarities and differences. The need for Bioinformatics
has arisen from the recent explosion of publicly available genomic information, such as that resulting from the Human Genome Project. To
address this, the National Center for Biotechnology Information (NCBI) was established in 1988 as a national resource for molecular
biology information. The NCBI creates public-access databases, develops software tools for analyzing genome data, and disseminates
biomedical information - all for a better understanding of molecular processes affecting human health and disease. The NCBI is a virtual
goldmine both in terms of available resources, and treasures yet to be discovered.

NCBI hosts the GenBank, the National Institutes of Health (NIH) DNA sequence database, and is famous due to its bioinformatics
software BLAST, which stands for Basic Local Alignment Search Tool. Using BLAST, you can input a gene sequence of interest and
search entire genomic libraries for identical or similar sequences in a matter of seconds. If the nucleotide sequence is published, the BLAST
search results will give the name of the gene containing the submitted sequence, as well as the organism it belongs to. Furthermore, the
search results list the entire close ‘matches’ to the submitted sequence in logical order (best match first).

Learning Outcomes

At the end of the activity, you need to:

1. Learn and understand the ways how online database classifies and organizes information on DNA sequences, evolutionary
relationships, and scientific publications.

2. Use and apply the NCBI search tool BLAST to an unknown nucleotide sequence.

3. Interpret and analyze the output of the NCBI searched tool BLAST.

Activating Prior Knowledge

Reflecting on the three learning outcomes above. Complete the Table below:

Learning Outcomes What do you know? Any questions/clarifications about learning


outcomes

3
Learning Activity 1

Read the procedures below while keeping in mind the learning outcomes. While reading and
performing the activity, be guided by the following questions and answer them after

Guide Questions:

1. In a BLAST search how do you know which sequence of DNA was closest to your unknown sequence?

2. If you had more nucleotides in your sequence to enter into BLAST, (say 1000 instead of 100), do you think you
would get more specific or less specific matches? Why?

3. Why do scientists all over the world check BLAST when they sequence a new region of DNA?

4. Why would a biologist studying human genetic disorders want to see related genetic entries for mice, cats, dogs
etc.?

Note: You will need a computer with an internet connection.


In this exercise, you will be given a nucleotide sequence found in real human DNA that is associated with a genetic disease
when mutated. Your job is to compare the sequences you are given with the nucleotide sequence of most known genes, using
the BLAST tool to search genetic databases.

1. Go to the homepage of the NCBI (www.ncbi.nlm.nih.gov )

2. Click on the word "BLAST" located in the list of links on the right of the page.

3. Scroll down until you find the heading "Nucleotide BLAST" and click the link.

4. This time you will practice using a part of the gene sequence that codes for hemoglobin. When a mutation occurs in
this gene a person can wind up with Sickle Cell Anemia. Cut and paste the sequence below into the blast window.

GGG ATG AAT AAG GCA TAT GCA TCA GGG GCT GTT GCC AAT GTG CAT TAG CTG TTT GCA GCC TCA
CCT TCT TTC ATG GAG TTT AAG ATA
5. When you have finished entering your sequence, click the button for “Others” and then click “BLAST”.

6. You will then see a screen asking you to wait 10-20 seconds. Don’t click on anything – relax and wait
patiently for the search to conclude.
7. After the search has ended, scroll down past the box and find the words "Sequences Producing
Significant Alignment".
8. Listed in order are the closest matches with your DNA sequence. You should notice that the blast search
gives you the results for all genomes currently mapped (Several prokaryotes, humans, rats, chimpanzees,
cows, pigs, chickens and puffer fish to name a few).
Please take a moment to be awed by the similarity found in the DNA code despite the outward physical diversity of organisms.

9. Click the blue Accession number to the far right of the first human listing to enter the gene information
page. Choose (HBB) gene, complete cds This will tell you the name of the gene and its abbreviation.
10. You should now be on a screen that has the following information. From here, you could find that
HBB is the official symbol for this gene.

11. From the drop-down menu, choose Gene and type HBB[sym]. Choose the first search result.

12. You will then be directed to the Gene Summary. If you read further, it will tell you that it is found
in humans and that a mutated version of it causes Sickle Cell Anemia.
13. Scroll down and you will find more information about the gene including its Location (under Genome
Context). Next click on the link to the “Genome Data Viewer” (open in new tab). While you are here,
takes a moment to explore the chromosomal nomenclature.

14. Across the top of the page, you will notice the numbers 1-22 XY. These numbers represent the chromosomes found in
humans. 11p15.5– This means that the HBB gene is found on chromosome 11 on the short (p) arm in region 15.5 (locus).
15. Your next step will be to click on the OMIM link. This will take you to a page with a lot of information
about your gene and what it does.

So, that is your basic tour of the NCBI.

16. Using the same steps as before you are going to explore the human genome. Each student will be assigned to two
gene sequences. (Students whose family name starts with A to G (seqs 1 and 2); H to M (seqs 3 and 4); N to S
(seqs 5 and 6); and T to Z (seqs 7 and 8).
17. Fill out the attached data sheets for both of gene sequences.

Gene Sequences: You will be assigned 2 to 3 of the following sequences to BLAST (Follow the instruction above for the assigning
of sequences).

Gene Sequence 1
ATG GCG ACC CTG GAA AAA GCT GAT GAA GGC CTT CGA GTC CCT CAA GTC CTT CCA GCA GCA GCA GCA GCA
GCA GCA GCA GCA GCA GCA GCA GCA GCA GCA GCA GC

Gene Sequence 2
ATG GCG GGT CTG ACG GCG GCG GCC CCG CGG CCC GGA GTC CTC CTG CTC CTG CTG TCC ATC CTC CAC CCC
TCT CGG CCT GGA GGG GTC CCT GGG GCC ATT CCT GGT GGA GTT CCT GGA GGA GTC TT

Gene Sequence 3
ATG CTC ACA TTC ATG GCC TCT GAC AGC GAG GAA GAA GTG TGT GAT GAG CGG ACG TCC CTA ATG TCG GCC
GAG AGC CCC AGC CCG CGC TCC TGC CAG GAG GGC AGG CAG GGC CCA GAG GAT GGA G

Gene Sequence 4
ATG TTT TAT ACA GGT GTA GCC TGT AAG AGA TGA AGC CTG GTA TTT ATA GAA ATT GAC TTA TTT TAT TCT CAT
ATT TAC ATG TGC ATA ATT TTC CAT ATG CCA GAA AAG TTG AAT AGT ATC AGA TTC CAA ATC T

Gene Sequence 5
ATG CGT CGA GGG CGT CTG CTG GAG ATC GCC CTG GGA TTT ACC GTG CTT TTA GCG TCC TAC ACG AGC CAT
GGG GCG GAC GCC AAT TTG GAG GCT GGG AAC GTG AAG GAA ACC AGA GCC AGT CGG GCC

Gene Sequence 6
ATG CCG CCC AAA ACC CCC CGA AAA ACG GCC GCC ACC GCC GCC GCT GCC GCC GCG GAA CCC GGC ACC GCC
GCC GCC GCC CCC TCC TGA GGG ACC CAG AGC AGG ACA GCG GCC CGG AGG AC

Gene Sequence 7
ATG TTG TGCAAT ATC CAT CTA CTG TAG TTA AGA TAT TCA GTA GTT TGT TTT TCA TAA GCA TGT AAT TGA TCA TAT
TTC TGC CAA GGA TGT GCC TTC AAC TTT ATA ATT ATA GTG TTG TAA AAT ATT TTT GTC TG

Gene Sequence 8
ATG CCA TCT TCC TTG ATG TTG GAG GTA CCT GCT CTG GCA GAT TTC AAC CGG GCT TGG ACA GAA CTT ACC GAC
TGG CTT TCT CTG CTT GAT CAA GTT ATA AAA TCA CAG AGG GTG ATG GTG GGT GAC CTT

Reminder: Before you proceed, go back to and complete the Learning Activity 1 above.
Assessment Activity

This is a major assessment activity that requires you to do the following.

1. Follow instructions in # 16 and 17 and answer the question below.

2. After identifying your unknown sequences assigned to you, fill up the table.

1. Gene sequence #
a. Gene Sequence Number

b. Abbreviation or name of gene

c. Chromosome number (location of gene)

d. Genetic disease associated with gene

e. Description of the disease associated with the


gene

2. Gene sequence #
a. Gene Sequence Number

b. Abbreviation or name of gene

c. Chromosome number (location of gene)

d. Genetic disease associated with gene

e. Description of the disease associated with the


gene

Question:

1. How does the information in the NCBI help to illustrate the interrelation of all organisms? Explain.
Reflecting on your Learning

After engaging in all learning and assessment activities, reflect on your learning. Which of
the two learning outcomes you have achieved?

Learning Outcomes What are your key learnings/highlights ofyour


learning?
1. Learn and understandthe ways in which
the NCBI online database classifies and
organizes information on DNA sequences,
evolutionary relationships, and scientific
publications.

2. Use and apply the NCBI search tool


BLAST in order to identify an unknown
nucleotide sequence.
Additional Information

Bioinformatics is the marriage of molecular biology and information technology. Web sites direct you to basic bioinformatics data
and get down to specifics in helping you analyze DNA/RNA and protein sequences. All this data comes at you in several formats,
so becoming familiar with various format types helps you know how to interpret and store the data.

Where to Find Bioinformatics Data

Bioinformatics combines information technology and molecular biology, so it makes sense that the Internet is the main arena for
pursuing bioinformatics information. The following list offers links to helpful Web sites around the world and the areas that they
specialize in:

• Ensembl: The Human Genome


• GenBank/DDBJ/EMBL: Nucleotide sequence
• PubMed: Literature references
• Swiss Institiute of Bioinformatics: Annotated protein sequences
• InterProScan: Protein domains
• OMIM: Genetic diseases
• GenomeNet: Metabolic pathways

Bioinformatics Web Sites for Analyzing DNA/RNA Sequences

The bioinformatics Web sites in the following list offer help in analyzing DNA and RNA sequences. And, in the marriage of
information technology and molecular biology that is bioinformatics, this type of analysis is what it’s all about.

• Webcutter: Restriction map


• GenomeScan: Gene discovery
• blastn, tblastn, blastx: Database search
• The Genome Browser: Browse the ultimate data!
• Mfold: RNA structure prediction

Bioinformatics Web Sites for Analyzing Protein Sequences

With bioinformatics you can explore molecular biology using information technology. The links to the Web sites in the following
list focus on protein sequences. Some offer searchable databases; others help you investigate a single protein; all are helpful:


BLAST: Database homology search

SRS: Database search

Entrez: Database search

InterProScan: Find protein domains

ExPASy: Analyze a protein

ClustalW: Multiple sequence alignment

T-Coffee: Evaluate multiple alignment
• Jalview: Multiple alignment editor
• PSIPRED: Secondary structure prediction
• Cn3D: Display and spin 3-D structures
Format Name Description

RAW Sequence format that doesn’t contain any header.


Spaces and numbers are usually tolerated.

This is the default format. Sequence format that


FASTA contains a header line and the sequence: >name
AGCTGTGTGGGTTGGTGGGTT

PIR Sequence format that’s similar to FASTA but less


common

MSF Multiple sequence alignment format

TXT Text format

GIF, JPEG, PNG, PDF Graphic formats. Do not use them to store important
information.

References

Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST. Evolution. Retrieved October 23, 2015
from http://media.collegeboard.com/digitalServices/pdf/ap/bio- manual/Bio_Lab3-ComparingDNA.pdf

Name that Gene Project. Retrieved October 23, 2015 from


https://www.tracy.k12.ca.us/sites/jhaut/Documents/Space_and_Engineering_2/projects/Name%20Gene/Name%20that%2
0Gene%20project%20w%20Jurassic%20park-2013.pdf

Wefler, S. H. 2003. Name that gene: An authentic classroom activity incorporating bioinformatics.
The American Biology Teacher 65:8610–613.
EXERCISE NO. ___B

Bioinformatics: Phylogeny

NTRODUCTION

Bioinformatics is emerging as a hugely important field affecting all areas of biology. While bioinformatics is
formally the application of computer technologies to biological sciences - ranging from automated analysis of
microarrays containing thousands of individual experiments to the development of browser tools for looking at whole
genomes - students in all areas of biology need to be familiar with software tools developed by bioinformaticians in
order to accomplish routine tasks in biology (Kibak, 2004).

In this exercise, you will learn how to use a software tool (Molecular Evolutionary Genetic Analysis or MEGA)
in analyzing sequence data from different species obtained from NCBI database. MEGA is an integrated tool for
conducting sequence alignment, inferring phylogenetic trees, estimating divergence times, mining online databases,
estimating rates of molecular evolution, inferring ancestral sequences, and testing evolutionary hypotheses. MEGA is
used by biologists in a large number of laboratories for reconstructing the evolutionary histories of species and
inferring the extent and nature of the selective forces shaping the evolution of genes and species (Tamura et al. 2013).

Learning Outcomes
At the end of this laboratory exercise, you need to:
1. Know how to use National Center for Biotechnology Information (NCBI) database to retrieve protein
and DNA sequence data.
2. Use the software tool MEGA to align sequences and analyze phylogenies.
3. Test the evolutionary hypothesis.

Activating Prior Knowledge


Reflect on the four learning outcomes above. Complete the Table below:

Learning What do you know? Any questions/clarifications in relation to learning


Outcomes outcomes

3
Learning Activity 1
Read the materials below while keeping in mind the learning outcomes. While reading, be guided
by the following questions and answer them after.

1. What is Phylogeny? Phylogenetic tree? What is its importance?

2. What is bootstrapping and how do you interpret bootstrap values?

3. What other areas of Biology can Bioinformatics be used?


Reading Material/Protocol

A. Creating a Phylogenetic Tree Using Nucleotide Sequences

In this exercise, you need a computer for data collection and analysis.

Research question:

Sardinella tawilis (Herre, 1927), which is endemic in Taal Lake, is the only freshwater member of its genus. Since the
lake (which fills a caldera) putatively formed only as recently as the 18th century after a series of large eruptions of Taal Volcano,
the origin of this species is enigmatic. According to Hangrove (1991), the lake was broadly connected to Balayan Bay until 1754
when a series of violent eruptions constricted and diverted the Pansipit River (cited in Wilette et al. 2014). Since tawilis was
believed to have a marine origin, what species is their most probable ancestor or closest extant relatives?

To answer our research question, we will build a phylogenetic tree of relatedness between Sardinella tawilis and other
sardine species (screened by similarities in morphology) using DNA sequence data. You will first retrieve the data from the
database NCBI for Sardinella tawilis, including 4 other sardine species and 1 outgroup. The target gene is the 16S rRNA gene
(sometimes referred to as 16S rDNA – ribosomal DNA codes for rRNA, and rRNA is the RNA component of ribosomes). MEGA
software package will then be used for sequence alignment and genetic analysis.

Note: The term rDNA seems to imply that ribosomes contain DNA, which is not the case. Ribosomes do not contain DNA. DNA is found within the nucleus,
mitochondria, and chloroplasts of a cell, and nowhere else. Primers, on the other hand, don't bind to the rRNA molecule, but rather to the corresponding
DNA part from which rRNA is transcribed. So, the rRNA gene or rDNA is in a sense, the same indication.

1. Download the software MEGA.

2. Log on to the NCBI homepage (http://www.ncbi.nlm.nih.gov/)

A. Obtain an appropriate mitochondrial 16S sequence for the analysis.

It is impossible to provide a reasonable guide to even a small section of this tremendous resource; you will have to explore it
yourself.

For example, you can search “Sardinella tawilis 16S” and you can refine your search by clicking “Nucleotide” under Genomes.
Every sequence or protein data has a corresponding accession number in the database. For this exercise, type the accession
number KC951492 into the search box to get 16s sequence data for Sardinella tawilis.

• After searching the accession number, you will be directed to a reference page that documents background
information on the origin of the sequence, principal investigators, journal references, etc.
• In the window next to the “Display” button, select “FASTA”. This will bring you to the nucleotide sequence
in FASTA format. The FASTA format is the primary format for sequence data that is recognized by bioinformatics
software.

• Copy and save the nucleotide sequence using a notepad and save in “. fas” file format.
B. Return to the Nucleotide search page by back clicking, erase your previous search terms, and type the following
accession numbers: KC951504 (S. hualensis), KM518945 (S. gibbosa), FR849560 (S. aurita), KM518956 (S.
lemuru).

Be sure to save a copy of the sequence data in FASTA file format. For the outgroup, type in “Amblygaster sirm 16s”
and choose 1 sequence data from the list.

C. Since you need several sequences to create a good phylogenetic tree, and to save time, make use of the sequences
given below. These sequences, together with the 5 sequences you have, including 1 sequence of the outgroup, will
be prepared for sequence alignment in MEGA. Your lab supervisor will give you the softcopy of this file in FASTA
format.

D. Save all the sequences in a single FASTA file.

3. Using the software package MEGA for sequence alignment.

a. Edit out all of the descriptive information except for the common name. Make sure not to remove the “>”
character, since that is how the software recognizes the beginning of the sequence.
b. Open MEGA. Follow these steps: File – Open a File/Session (Find the location of your file) – then click
Align (How would you like to open this fasta file?).
c. You will see a picture as above (2c). Find the icon “W” and click Align DNA from the drop-down menu.
Then click “Ok: then another “Ok”. This will align your sequences.
d. You will notice that the aligned sequences are not of equal lengths. Highlight the parts you want to cut,
then click “delete” on your keyboard. Do this on both ends of the sequences.
e. After editing the sequences, click “Data” then choose Export alignment – MEGA format (type your desired
filename) – Save – Input title of the data (type your filename again), then click “Ok”
f. You will be asked if the sequences are protein-coding, click “No”
g. Minimize the window containing the aligned sequences.
4. Using the software package MEGA to construct a phylogenetic tree.
a. To create a phylogenetic tree, click “Phylogeny” then find the location of your sequence in MEGA format (.meg)
b. From the drop-down menu, choose “Construct/Test Neighbor-Joining Tree”. With the Neighbor-joining method,
genetic distances are used to actually calculate the lengths of the tree branches. This is a very powerful approach for
building phylogenies that was made possible by the advent of molecular genetics.
c. Set the test of phylogeny to bootstrap method, then type 1000 as no. of bootstrap replications, then click “Compute”.
d. After the computation, you already have your phylogenetic tree.
e. Interpret your results. (Your result and answer in letters d and e will be placed under the assessment activity)
B. Creating a Phylogenetic Tree Using Protein Sequences

Research question:

Typanosomatida are unicellular eukaryotes belonging to the order Kinetoplastidae. They are among the most versatile
parasites in nature, infecting mammals, fish, and plants, and are usually transmitted by insect vectors. The kinetoplast of typanosomatida
contain many copies of the mitochondrial genome; however, present evidence supports that trypanosomes once possessed a chloroplast
that they lost some time in their distant evolutionary past (Martin, 2003). So, are they more closely related to plants or animals?

To answer this research question, your supervisor will give you the sequence data containing the protein sequences from
Trypanosoma cruzi, 2 plants: Arabidopsis thaliana (small flowering plant) & Oryza sativa Japonica (rice), and 2 animals: Anopheles gambiae
(mosquito) QHD56795 and a Hippopotamus amphibius (hippo).

The letters in the sequence data corresponding to the specific amino acids that comprise the Cytochrome C protein. The following single-
letter code is used to designate the various amino acids.

a. Repeat all the procedures you have previously done in order to make a phylogenetic tree (Numbers 3 and
4 in the previous exercise).

b. Interpret your results. (Your result and answer in letters d and e will be placed under the assessment
activity).

Reminder: Before you proceed, go back to and complete the Learning Activity 1 above.

Assessment Activity
This is a major assessment activity that requires you to do the following.

1. Copy and paste each of the phylogenetic trees for the nucleotide and protein sequences generated
2. Interpret the results by answering the following:
A. Creating a Phylogenetic Tree Using Nucleotide Sequences

Based on the phylogenetic tree you have made, what is the answer to your research question? Explain.

B. Creating a Phylogenetic Tree Using Protein Sequences


Based on the phylogenetic tree you have made, what is the answer to your research question? Explain.
Reflecting on your Learning

After engaging in all learning and assessment activities, reflect on your learning. Which of the three learning
outcomes you have achieved?

Learning Outcomes What are your key learnings/highlights of your learning?

1. Know how to use National Center for


Biotechnology Information (NCBI)
database to retrieve protein and DNA
sequence data.

2. Use the software tool MEGA to align


sequences and analyze phylogenies.

3. Test evolutionary Hypothesis


References

Claverie, JM and Notredame. 2007. Bioinformatics for Dummies (2nd ed). Wiley Pub. Inc., NewJersey, USA. Pp.1-
457.

Hargrove, T. R. 1991. The Mysteries of Taal: a Philippine volcano and lake, her sea life and losttowns. Bookmark Pub.

Introduction to Bioinformatics and Molecular Genetics. Retrieved October 22, 2015 from:
http://www.colby.edu/biology/bi164/Lab/Lab%2012Bioinformatics.pdf

Kibak, Henrik. 2004. An Introductory Bioinformatics Lab on Molecular Phylogenetics. RetrievedOctober 22, 2015 from:
http://science.csumb.edu/~hkibak/241L_web/HipposWhales.html

Martin, W., & Borst, P. 2003. Secondary loss of chloroplasts in trypanosomes. Proceedings ofthe National Academy
of Sciences, 100(3), 765-767.

Tamura, K., Stecher, G., Peterson, D., Filipski, A., & Kumar, S. (2013). MEGA6: molecular evolutionary genetics
analysis version 6.0. Molecular Biology and Evolution , 30 (12), 2725-2729.

Willette, D. A., Carpenter, K. E., and Santos, M. D. 2014. Evolution of the freshwater sardinella, Sardinella tawilis
(Clupeiformes: Clupeidae), in Taal Lake, Philippines and identification of its marine sister-species, Sardinella
hualiensis. Bulletin of Marine Science, 90(1), 455-47

You might also like