DNA Project 2014

DNA Sequence Analysis and Poster
Manual Spring 2013
1
INTRODUCTION
Welcome to the Microbial Genetics sequencing project!
This assignment has several goals:
• To reinforce your knowledge of core biological concepts such as: DNA transcription
and translation;
• To introduce you to the use of computers for DNA sequence analysis;
• To strengthen your ability to perform primary literature searches of a particular subject

in microbiology & molecular biology;
• To expand your abilities to present data in a poster session format.
I hope that you will enjoy your introduction to how a microbial geneticist uses computer
technology to learn about DNA sequences and what they code for.
Work on the class project will be in pairs. You are encouraged to discuss each program and
its output with your partner. However, it is important that each of you understand how to run
each program and what the output means. Finally, you must work together to organize, write
and assemble a poster on your data that conveys the most important information about your
sequence analysis and the biological function of the product.
Tutorials designed to help you learn how to use the software will be taught during regular lab
time in the Biology Learning Center (BLC) in the Koffler building. Computer lab sessions
are designed to assist you in using the programs. The BLC is open everyday and you are free
to work on your own time on your project unless another class has reserved the room. If you
have problems, please contact Dr. Baltrus (Baltrus@email.arizona.edu)
Additional notes:
• There are literally hundreds of software programs available on the web for DNA
sequence analysis and a number of software packages that can be purchased from
various companies.
• Feel free to explore the other programs. These are wonderful tools, and can perform in
minutes what would take humans a lifetime to accomplish. If you are planning a
future in genetics or microbiology research lab, the ability to manipulate DNA
sequences and to use a genetics database search tool is a necessity.
• If you know of or discover specific programs that could be useful, or more useful, fell
free to let us know. We do not claim to know all of the programs available.
GOOD LUCK AND HAVE FUN!
2
Pseudomonas sp.
strain GAW0119
~6,000,000 bp
3
Contents
SECTION 1:
Welcome 2
Background on Pseudomonas sp. GAW0119 5
Printing Your DNA Sequence/Handling Program Outputs 6
SECTION 2:
Identifying Potential Coding Regions in Your Sequence 8

Blast analysis 13
Verifying ORF Using Codon Usage 18
Checking Third Position %GC Bias 19
Identifying Potential Ribosome Binding Sites 21
Phylogenetic Analysis of Your Gene Product 21
Multiple Sequence Alignment/Alternate trees 23
Identification of Potential Regulatory Features 28
Promoter Regions 28
Terminator Regions 31
Determining the Biological Function of Your Gene Product 32
Presentation of Your DNA Sequence and Major Sequence Features 36
4
BACKGROUND on Pseudomonas sp. strain GAW0119
This year we will be using DNA sequences obtained from a Pseudomonas species
isolated from environmental samples. This strain was originally isolated from a river in France,
and as a part of the strains referenced in this paper: http://mbio.asm.org/content/1/3/e00107-10/
T1.expansion.html
This strain is interesting from our viewpoint because it serves as a phylogenetic outgroup
for phytopathogenic pseudomonads such as P. syringae and P. viridiflava, but is more closely
related to these strains than Pseudomonas fluorescens. We “think” that strain GAW0119 is
unable to cause disease in plants and therefore expect that many virulence factors are absent.
With this idea in mind, analysis of a genome sequence for this strain provides exceptional
opportunities to understand how plant pathogenicity and virulence evolve from otherwise
ubiquitous environmental bacteria.
We used short read Illumina technology to sequence the genome for this strain. The idea
behind this strategy is to generate hundreds of millions of 100 bp sequencing reads randomly
from across the genome, and use computer algorithms to piece these short fragments in the
correct order. For this strain, we have been able to assemble millions of fragments into
approximately 500 longer contiguous regions of DNA (called contigs). Roughly half of this
genome sequence can be found in contigs > 126,944 bp, and so the N50 of this genome is said to
be 126,944. In some cases, these contigs may have runs of “N’s” in the middle of them. These
regions indicate that the DNA sequences on either side of the “N’s” are oriented correctly, but
there was not enough sequencing coverage to fill in the N’s. In this case, the contigs are known
as scaffolds. In contrast to other complete genome sequences, in which each chromosome or
plasmid sequence resides on 1 contig, these fractured genomes are known as draft assemblies.
Even though this sequence is only at draft status, there is more than sufficient information to
begin to annotate genes and search for known proteins.
5
Your DNA sequence has been sent to the email address that you provided me. It has
been sent as a simple text file without any formatting.
MANIPULATING AND PRINTING YOUR DNA SEQUENCE
! ******Word of advice, your DNA sequence will be almost 3-5Kb long. If you
printed this out in regular font, it would take multiple pages. Ultimately I would like to see
a printed, annotated version of this sequence on your poster in readable form yet not so
overwhelming that there is no other space on the poster. There might also be
unnecessary parts of your sequence that you can leave out of the analysis, but that’s up
to you.
! ******Second word of advice, it may not be feasible to work with all of this
sequence at once. It might be much easier to break it up into smaller sections and
handle each of those separately for the following analyses.
! The sequence sent to you is single-stranded and is in the 5’ to 3’ direction. To

make a nice double-stranded (ds) and numbered version of your raw sequence, go to:
http://www.vivo.colostate.edu/molkit/index.html
The main page of Molecular Toolkit is shown below:
6
Click on “Manipulate and Display Sequences” and you should see the following page:
Copy and paste your raw DNA sequence into the box. Select “Double-stranded” and
click the “Show base numbers” option. Your sequence should now look like this below:
Change the font size to whatever you feel is best, and the font to Courier New. Save
this file on the desktop.
HANDLING PROGRAM OUTPUTS
! The BLC has no capability to print output from this exercise. Therefore, all output
should be saved as a PDF file on the desktop. Macs make this easy. You can then
either:
A. log into your email account and attach the files and send them to yourself to ! print
somewhere else.
B. download them onto a USB stick by dragging the filename from the desktop onto the
stick.
7
IDENTIFYING POTENTIAL CODING REGIONS IN YOUR SEQUENCE
The first step is to identify whether there are any N’s in your sequence. If there are N’s
that means your sequence is a scaffold rather than a contig. You can’t really do anything
with these N’s because it is unknown what sequence is supposed to be there, but
further comparison across organisms (see later in the manual) may let you take an
educated guess.
The next step is to determine how many potential Open Reading Frames (ORF)s or
operons your DNA sequence contains. An ORF is a part of the organisms genome
which contains a sequence of bases that could potentially encode a protein. ORFs
begin or start usually at an ATG (representing AUG in the mRNA) and are a contiguous
series of bases groups in triplets that encode amino acids until a stop codon signals
translation termination.
In general, bacterial ORFs begin with an ATG (AUG) 83% of the time. However, some
bacterial ORFs begin with GTG(GUG) or TTG (UUG) ~14% or ~3% of the time. The 3
stop codons in bactera are TGA (UGA), TAA (UAA), and TAG (UAG).
Option 1: Glimmer
We will use the software program Glimmer (Gene Locator and Interpolated Markov
ModelER). Glimmer is a gene prediction software program useful for searching for
potential ORFs our DNA sequences. Glimmer is very effective at finding genes in
bacteria, archaea, and viruses and typically finds 98%-99% of all protein-coding genes.
One important note: ORFs may be found on either strand of your ds DNA sequence and
may extend outside your sequence.
--Go to http://www.cbcb.umd.edu/software/glimmer/
Click on NCBI Glimmer
8
--Paste your ss DNA sequence into the box. Your simple text will do nicely as long as
there are no characters other than ATCG or N
--Fasta format means that the first line contains a “>” followed by a name of the
sequence. For example:
>Sequence1
ATATATATATATATATAT
--Be sure (Bacteria, Archaea) is selected and change the Topology to Linear. Press
RUN. Your output should look something like that below:
9
The “orfID” column lists all of the ORFs that were found. “start” is the position in the
sequence where each ORF potentially starts, and “end” is the position where it ends.
“frame” gives the open reading frame. + numbers indicate that the ORF is found from 5’
to 3’ exactly on the ssDNA piece you pasted in. - numbers indicate that the ORF is
found on the complemented strand to the piece you pasted in.
Option 2: We will utilize a useful website maintained by the National Center for
Biotechnology Information. Go to http://ncbi.nlm.nih.gov
This is the intro page for NCBI:
--Click on “Tools”
--Search through list for “Open Reading Frame (ORF) Finder”
10
--Paste your ssDNA sequence into the box. Click “OrfFind”
--You should see a series of turquoise boxes
--You can also view a step by step tutorial here: http://www.youtube.com/watch?
v=FbhJUx7K5rE
11
The top three lines represent potential ORFs in the original direction of your sequences
(5’ to 3’). The lower 3 lines represent potential ORFs running in the opposite direction to
your ds DNA sequence. Remember, you are only given one strand, but the
chromosome is composed of double stranded DNA and genes are represented non-
overlappingly on both strands in opposite directions. NOT ALL OF THESE WILL
ACTUAL
--Start clicking through the potential ORFs
The rectangle is colored and potential ORFs are shown below. Note possible ATG’s are
highlighted in turquoise and the stop codon is in pink. A true ORF usually has an ATG at
the beginning (but sometimes GTG or TTG). Below the sequence are amino acids
coded for by the codon triplets in the ORF.
--Remember that genes in bacterial genes are ~1kb on average (doesn’t mean they
can’t be bigger or smaller) when considering which of these is a true ORF. However,
there should only be one ORF within a given section of DNA as bacterial ORFs
generally don’t overlap.
12
--Repeat this for several possible ORFs. Save your output, as it will be useful later when
analyzing these sequences.
--At this point you can also use the ‘blastp’ program to check and see if the translation
of this ORF is annotated as a protein sequence in other organisms, or find if it has a
gene name or protein name in other organisms. See http://www.youtube.com/watch?
v=HXEpBnUbAMo for a nice blast tutorial.
blastp searches through all annotated protein sequences at the NCBI database and
tried to match your sequence with those.
13
You are given a variety of options to modify for the search, or you can click “View
report” to perform the blast search. I suggest you limit results to 50.
At the top of the output screen will be protein motifs that are recognized within this
potential ORF. This will be followed by many differently colored (or just red) bars that
indicate sequence similarity or your query ORF with proteins in the database. Black
means there is no similarity, red means very high similarity. If you hover the mouse
arrow above these bars it will give you descriptions in the box above. You can also see
a numbered bar above these lines, which represents sequence positions within your
queried ORF. Clicking on those colored lines will bring up protein alignments for your
ORF and those identified by this search. If you click on a bar, it takes you to the
alignments below. HOWEVER, if you just scroll down you will see a list like this:
14
There is a lot of information provided here.
gb: genBank database hit (the protein ID of the matched protein)
AAK73190.1 (or something similar) = the accession number for the matched protein.
***If you click on the ID, it will bring you to another page with a lot of potential
information such as gene names. If a genome is annotated well (P. syringae pv. tomato
DC300 is annotated extremely well) this page can give you a lot of useful information.
A brief description of the protein
Score (Bits) = in general, the higher the Bit score the better the alignment.
E Value= Expectation value, the number of different alignments with scores equivalent
to or better than each alignment that are expected to occur in a database search by
chance. The lower the evalue, the higher the chance that it is a real match. Think of this
as a probability that the match is “correct”
G in box: takes you directly to the Entrez gene site that provides a lot of information.
This might not be there.
****For later***** You can choose which sequences to use in a phylogenetic analysis by
clicking the boxes next to each blast hit
Next scroll down further to the alignments:
This alignment contains a lot of data. On the top is the identifier line of the protein match
of your ORF, associated with the colored line you just clicked. Below this identifier line is
the sequence ID in the NCBI database for this matched protein as well as its total
length. Note which strain and species the matched protein is found in. Next is a line that
shows information about the match itself. Specific numbers of note are:
15
! Identity: the number of amino acids that match exactly between your query ORF
! and this particular matched protein
! Positives: the number of amino acids that either match exactly between the query
! ORF and the particular matched protein, including inexact matches that maintain
! basic characteristics of the amino acid.
! Gaps: the number of insertions or deletions between your queried ORF and the
! matched protein.
Next you will see an actual alignment between the protein sequence that your ORF
query codes for and that of the matched protein. The “Query” line is your ORF and the
“Sbjct” line is the matched protein. In between these two lines are letters where there
are exact matches between your queried ORF and the matched protein, or + signs
where the chemical characteristics of the amino acids is similar even though there is no
match.
You will also see positions within your ORF and the matched protein that line up. Ideally,
the protein match will start at Query 1 and Sbjct 1 and align perfectly over the whole
sequence of the matched ORF. If the first number in the query alignment is larger than
in the matched protein alignment, it means that your called ORF has a longer N
terminus than that other annotated protein. This could be because it truly is a longer
protein or if your start codon is incorrectly called. If the beginning of your ORF query is a
small number, but the matched protein is high it could mean that the true start codon
lies outside of the region of DNA you were given, or that this ORF is truncated. In the
example above, the queried alignment starts at 24, whereas the match starts at position
7. If your ORF is shorter than the matched protein so these proteins are aligned fairly
closely.
--When you believe you have found a “true” ORF that codes for a protein, select
“Accept”.
--Note that you can check for alternate start codons.
Click on the down arrow next to “View”. Choose “Fasta nucleotide” and click “View”.
16
As I described above, FASTA format includes a definition line first (denoted by “>”
followed by description”. Each line after that is 80bp of DNA sequence for that ORF.
This is simply a text-based format for DNA or protein sequences. The > allows identifier
information to be included but this data is not analyzed when using various programs.
--Save your ORF DNA sequence in FASTA format.
Alternatively, you can copy and paste the exact bases of your suspected ORF into a MS
word file. You may have to change the font and size. Remove the base numbering. To
do this, you can go to: www.vivo.colostate.edu/molkit/index.html and paste the sequence
in and click double stranded. You see that the top box shows your sequence without
numbers of amino acid symbols.
Since your DNA fragment is 10kb, it is very likely that there will be multiple ORFs.
Perform these analyses to find as many believable ORFs as possible in your sequence.
Are the ORFs part of an operon? Think of what it means to be in an operon in terms of
spacing between the ORFs, which strand of DNA the ORFs are on, etc...
CHECKING CODON USAGE IN EACH ORF
After identifying your ORFs, another useful tool can be found at:
http://www.bioinformatics.org/sms2/codon_usage.html
17
This website allows you to compare codon usage patterns across all of your ORFs.
Click on Codon Usage on the left
Paste the ssDNA sequence of the ORF you want to test into the box. Change the
genetic code to be used by clicking the down arrow. Select bacterial (11).
Press submit and you will see an output page
18
The genome of GAW0119 isn’t publicly available yet, but I’ve already told you that this
strain is phylogenetically related to both P. syringae and P. fluorescens. You can Google
“P. syringae codon usage table” and find codon usage tables to compare your ORFs to
in related organisms. Is codon usage pattern similar between your ORF and the the
average for the genomes of these organisms? Why might they be different?
CHECKING THIRD POSITION %GC BIAS IN YOUR ORF
There are some additional approaches that are useful for describing your ORFs. The
genome of GAW0119 is GC rich (~60%). One consequence of this is that the third
position of each codon is often a G or a C. You can test this by looking at bias in the
third base, aka skew.
--Go the the EMBOSS Wobble website at:

http://bioweb.pasteur.fr/seqanal/interfaces/wobble.html
19
Paste in your ‘FASTA formatted’ DNA sequence that you saved earlier during the ORF
finder section. ‘Run’ wobble. Put in your email address. The results will be emailed to you.
Click ‘wobble.1.png’ to see a graph of the 3rd position GC bias of your ORF.
If you have found a real ORF, the third position will be G or C approximately 60% of the
time. If the GC% is much lower or higher, it could indicate that this region was introduced
into the bacterium fairly recently via horizontal gene transfer (HGT).
20
IDENTIFYING POTENTIAL RIBOSOMAL BINDING SITES
The AUG start codon is one part of the Translation Initiation Region (TIR). The second part
is the Ribosome binding site (RBS). This is critical as it tells the ribosome which ATG start
codon to use versus just a methionine codon within an ORF. Computer programs are poor
at identifying RBS due to their variable distance (2-10nt) upstream of the start codon and
the variability in their sequences.
Visually scan your sequence upstream of the start codon. Do you see a purine rich set of
bases spaced appropriately? Assume that this is your RBS and indicate this sequence in
the sequence you include on your final poster.
PHYLOGENETIC ANALYSIS OF EACH GENE PRODUCT
Above I mentioned that you can click the blast hits for use in phylogenetic analysis. For
each ORF of interest, return to the blast output screen:
At the top is a link that says “Distance tree of results”. Scroll down through the blast
matches and pick ~20 matches that encompass a range of relatedness, from closely related
to distantly related. If this isn’t possible (if they are all close) expand your blast search to
more than 50. Select these sequences by clicking the gray boxes next to each hit. When
you click the boxes and select “Distance tree of results” it will give you a phylogenetic tree
of the hits you selected. Click on the Tree Method and change to “Neighbor Joining”. Also
add the name of your protein under “Sequence title”. Your output should look similar to the
one shown below:
21
At the
This phylogenetic tree represents the evolutionary relationships among your protein and
those closely related to it as determined by your blast analysis. It is based on the field of
cladistics that analyzes evolutionary relationships between groups in order to construct
trees.
The neighbor-joining (NJ) method is commonly used for DNA or protein sequences
because it constructs relationships among objects without all objects having diverged by
equal amounts (i.e. no molecular clock). This approach produces unrooted trees which
examine relationships among aa sequences without assuming anything about common
ancestry.
A full understanding of what a phylogenetic tree represents is beyond this course. However
http://evolution.berkeley.edu/evolibrary/article/phylogenetics_02 provides a good overview
of how to read phylogenetic trees. For example, it defines terms you hear quite often such
as “parsimony”. The parsimony principle is that the simplest explanation that fits the data
is usually the best. Sort of like Ockam’s razor.
22
However, for our purposes we can make several determinations.
1. Determine the identity of your ORFs of interest (i.e mexR in P. aeruginosa)

2. The most closely related proteins to the ORFs of interest and which species they are
found in. I already told you that GAW0119 is phylogenetically nestled in between P.
fluorescens and P. syringae. If the closest related gene sequence is from a species unlike
these two, that means your region was likely acquired through horizontal transfer.
Note that the major groups of bacteria, such as enterobacteriaceae, beta- and gamma- are
indicated by colored balls.
MULTIPLE SEQUENCE ALIGNMENT
Sometimes it is useful to align related sequences from a number of organisms to see where
in their sequence they are similar and where they have the most divergence. This can be
extremely useful when trying to identify conserved domains or motifs that may carry out
conserved functions such as hydrolyzing ATP for energy (ATPase), or other activities.
Go to www.microbesonline.org
First you need to register for the site. Click ‘Register’ at the top of the page. Fill in the
required details. Now you can start your analysis.
There are 3 boxes across the top of the page (Add Genomes, Genomes Selected, and
Search Genes in Selected Genomes).
In the ‘Add Genomes’, type “Pseudomonas” in the box.
23
Select ‘Pseudomonas (23 genomes)’, click ‘Add’. They will appear in the ‘Genomes
Selected’ box.
Type in the name of your gene in the search box. You can find this name by sorting back
through the blast hits. You can also use search terms identified in your blast search to query
a variety of genomes. I suggest going back to the blast result, finding blast matches in
genomes that are represented in microbesonline.org, and using those blast results for
analyses (like P. syringae DC3000). Will make your life a lot easier because some genomes
are annotated much better than others.
Click “Find Genes”
Add to gene cart
Find genes of interest and add them to your Gene Cart for future analyses. ct.
If you can’t find names or descriptors for your ORF of interest, you can also use a
“Sequence Search” link at the top of the page to enter in your sequence and search this
website to find matches.
Once you’ve added multiple genes to the Gene cart, there are a variety of tools on this site
that you can explore and potentially use for your project.
24
Click “Save this temporary gene cart”
See a new window, Click “View cart in genome browser”.
This page shows you the organization of the genes surrounding your ORF of interest in all
of the strains listed (which were put there by placing their genes in your gene cart). Place
your cursor over genes of interest and click, it will take you to a page describing that gene in
that bacterium.
***This is a very useful figure for determining whether your genes are in operons and how
much the gene order changes for your ORFs of interest across different bacteria. If you
have identified multiple ORFs in your sequence, do other genomes look like they have
these ORFs in the same orientation? Based on these regions in other genomes, are there
other ORFs within your sequence that you may have missed? You can learn a lot by
comparing these regions in different bacteria. If your region has N’s in it...comparison
across genomes can tell you how much DNA should replace those N’s and what types of
genes should be there.
Go back to the cart page by clicking the Back button on your browser.
Click ‘Create a Multiple Sequence Alignment’. Hit ‘Submit job’.
You will get a screen saying that your job has been submitted and is currently running. After
~1-2 minutes you will see another screen:
25
Click ‘JalView’. Note that the amino acids are colored to help visualize the similarities in
different regions of the compared protein sequences. Note regions more highly conserved
than others. Could this provide insights into differences in evolutionary selection between
regions as far as function is concerned?
26
Go back to the previous screen. Scroll down the page, click ‘Submit job’. Your job is now
running. You will see a new window when this is done in ~1-2 minutes.
You can save your results by clicking on ‘Save My results’. How does this tree compare to
the first one you did?
Click on ‘View this tree with gene neighborhood context’ (It’s near the bottom of the
page). Does this change your interpretation of relatedness?
27
IDENTIFICATION OF POTENTIAL REGULATORY FEATURES
Promoter regions:
Since a promoter region that initiates transcription occurs upstream of the coding region of
ORFs and operons, you will need to retrieve your original sequence and use the region
upstream of start codons of interest. As we will discuss in class, promoters are defined
simply based on their function as a sequence recognized by RNA polymerase rather than
their sequence. However, in many cases knowing where a promoter is located is important
for biological research. THERE IS A LOT OF CHAFFE TO SORT THROUGH WITH THIS
AND YOU MIGHT NOT GET MUCH OUT. KEEP IN MIND WHAT EACH PROMOTER
THAT THESE PROGRAMS GIVES YOU ACTUALLY REGULATES AND ASK YOURSELF
WHETHER THIS MAKES SENSE WITH WHAT YOU’VE READ OR THINK.
What follows are a series of websites that TRY to find promoter regions. Try each of them
and compare the outputs. Identification of the same region by multiple programs may
indicate that you have identified a real promoter.
PPP: http://bioinformatics.biol.rug.nl/websoftware/ppp/ppp_start.php
Past in your upstream sequences IN FASTA FORMAT, and search. Try changing Evalues.
Another site is by Softberry at:
http://linux1.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter
28
Check for the “Search for Motifs” tab on the left side
Select “BPROM” (for Bacterial promoters). Follow the instructions (and look at the Help
page for descriptions).
Virtual Footprint:
http://www.prodoric.de/vfp/index.php.
29
Click on “Promoter Analysis”
Step 1: Paste in your sequence
Step 2: Select “All”
Press “Start”
Examine the Output.
This program also has a site for ‘Regulon Analysis’ that might be interesting to try.
Terminator Regions:
30
Once transcription has been initiated it is equally important that RNA polymerase know
when to stop transcription and release the nascent RNA chain. Bacterial transcription
terminators can be divided into two types, factor independent and factor dependent. The
best characterized terminators from a sequence standpoint are factor-independent
terminators. These sequences have a characteristic inverted repeat sequence (usually GC
rich) followed by a run of A’s or U’s (3-10 bases).
The inverted repeat sequence can form structures called stem loops that result in RNA
polymerase pausing. The presence of a run of U’s or A’s destabilizes the transcription
bubble resulting in dislodging of RNA polymerase and RNA strand release. Thus, most
terminator programs try to identify sequences that can form potential stem loops.
We will try to identify transcription terminators in your DNA sequence a couple of ways.
First, you want to make a file that contains the end of your ORFs and downstream DNA
sequences. This is important as inverted repeat sequences can occur anywhere in the
sequence but can only be biologically relevant as terminator sites at the end of a gene.
Go back to the Softberry website at:
http://linux1.softberry.com/berry.phtml?topic=index&group=programs&subgroup=gfindb
Click on “FindTerm”
Paste the end nucleotide sequence in the box. Click ‘Run Java viewer’ and press
‘Process’. It may take a while for your results to be available.
31
DETERMINING THE BIOLOGICAL FUNCTION OF YOUR GENE PRODUCT
So now you have:
! -Identified potential ORFs

! -Determined what the protein products are and what genes code for it
! -Designed a publication quality printout of your ds sequence, including amino acid
! sequences for identified ORFs in alignment form, and with the locations of specific
! features such as promoters and terminators highlighted.
The last part of your exercise is to understand the biological function of the product encoded
by your gene.
In bacteria, often genes with similar functions are which are involved in similar pathways are
clustered on the chromosome. So let’s find out what other genes may be closely linked to
yours.
Go to the Pseudomonas Genomic Database (this site will be a goldmine for you if you
let it) website at:
http://www.pseudomonas.com/
32
Click on ‘Database Search’. (But also note the gray box on the right side)
If you know your gene name you can enter it in here and search.
If you find a match, click on genes of interest in different organisms.
You will see the gene you clicked highlighted in the middle of the screen along with its ORF
number (locus ID). The locus ID changes for each different genome sequence. This also
tells you the exact base coordinates of this gene in the sequence genome and the name of
the product of the gene.
33
This is the page for your gene as determined by its Locus ID from another genome. This
page has much more information than I will describe here. Try different links for more
information. Notice what other genes are located next to your ORFs of interest. Do they
match other ORFs you’ve identified from your region?
As you explore this website, you will likely find a variety of information out about your gene
including:
-putative orthologs in other bacteris

-COGs: clusters of orthologous groups
-PFAM motif predictions
-Subcellular localization
-KEGG metabolic pathway information
-Locations of putative terminators
-Whether transposon mutants are available for this gene
-%GC curve and 3 frame translation data
A brief description:
COG: A protein or group of proteins (typically orthologs or paralogs) that come from a
minimum of 3 lineages which will ultimately correspond to the ancestral domain.
PFAM: a large collection of multiple sequence alignments and hidden Markov models
covering many protein families.
KEGG: Kyoto Encyclopedia of Genes and Genomes. KEGG is a database of biological

systems, consisting of building blocks of genes and proteins (KEGG Genes), chemical
building blocks of both endogenous and exogenous substances (KEGG Ligand), molecular
wiring diagrams of interaction and reaction networks (KEGG pathway), and hierarchies and
relationships of various biological objects (KEGG Brite). KEGG provides a reference
knowledge base for linking genomes to biological systems and also to environments by the
presence of PATHWAY mapping and BRITE mapping.
****Once you’ve got a pretty good idea of what your genes of interest do. Go back to the
NCBI homepage at http://www.ncbi.nlm.nih.gov/. Type the gene name in the box and search
“All databases”. You should get a page like this:
34
Start with Pubmed and see if you can find a couple of good articles that tell you more about
the functions of your DNA region.
35
PRESENTATION OF YOUR POSTER PROJECT
As I mention above, your sequence will be ~10kb long, and will take up a lot of space if you
print it all out on the poster.
As such, what I expect in the final poster is a mock up of the sequence showing all ORFs,
promoters, terminators, RBS, start and stop codons. Where it is necessary to understand
gene sequence (RBS sites) please include the sequence. Show me which strand the gene
is on, and which direction transcription takes place. If there is an operon in your region,
show me this.
This is the base of what you can do. Please be creative and demonstrate to me that you’ve
thoroughly analyzed your region of interest. This should be Fig. 1
For each of the true ORFs within your sequence (if there are more than 3, pick the largest
three) demonstrate to me various aspects of these sequences relative to other
pseudomonads using the tools I’ve shown you here or any other tool you’d like to use.
Were these ORFs recently acquired by GAW0119? How do you know? Is the orientation of
this region conserved across pseudomonads? How do you know...etc...I’ve provided you
with many tools to start exploring these sequences.
The final poster product should look something like the one shown below (this is an
example..there can be many variations on this theme). There should be a descriptive title,
underneath which should be the names of both partners.
There needs to be an introduction section where you describe the overall goals of this
project and the organism of interest. The introduction should be followed by multiple figures
relaying information about your region of interest. I’ve already laid out what I’d like for figure
1.
36
***It is important that your figure legends give enough information for a non-specialist to
determine what is going on in the figure. Make them descriptive enough so I can tell that
you know what you are doing and not just copying words.
In all cases, there will be a fairly well characterized gene within your region. One good thing
to do might be to research what is known about the functions of this gene within other
bacteria. Is there something special about this gene within pseudomonads that is different
from other species? How is this gene regulated? Are there known sites of activity within this
gene (if so, might be good to highlight important amino acids in your schematic). I can’t tell
you what things to look for specifically for each region within this manual, but the posters
with the highest scores will include a lot of value added information. What genes are
flanking your main gene in other pseudomonads, are these the same in your region? Are
there any transposon mutants available in other pseudomonad species? etc...etc...etc...
Making your poster with scissors and paper vs. printing: Printed posters will not
automatically get higher grades, but designing your poster on a computer and making it look
clean is much easier than doing the same with scissors and printouts. Basically the way I
look at this is that the money spent towards printing your poster equals time that you don’t
have to spend making figures and the presentation look good. I’ve seen great posters with
hand cut designs and printouts and I’ve seen crappy posters that have been printed out.
Please email me either a picture of your poster or a PDF file of your poster when
completed, as well as presenting it during the symposium.
Be creative and show me through the poster that you’ve put work into exploring the
functions of your region of DNA. Below is a sample schematic, but by no means use this as
an exact poster template.
37
38
39

DNA Project 2014

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

DNA Project 2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNA Project 2014

Uploaded by

Copyright:

Available Formats

DNA Sequence Analysis and Poster

Manual Spring 2013

This assignment has several goals:

• To introduce you to the use of computers for DNA sequence analysis;

• To strengthen your ability to perform primary literature searches of a particular subject

• To expand your abilities to present data in a poster session format.

GOOD LUCK AND HAVE FUN!

Identifying Potential Coding Regions in Your Sequence 8

MANIPULATING AND PRINTING YOUR DNA SEQUENCE

! The sequence sent to you is single-stranded and is in the 5’ to 3’ direction. To

The main page of Molecular Toolkit is shown below:

HANDLING PROGRAM OUTPUTS

Click on NCBI Glimmer

This is the intro page for NCBI:

--Start clicking through the potential ORFs

gb: genBank database hit (the protein ID of the matched protein)

A brief description of the protein

Next scroll down further to the alignments:

--Note that you can check for alternate start codons.

--Save your ORF DNA sequence in FASTA format.

CHECKING CODON USAGE IN EACH ORF

Click on Codon Usage on the left

Press submit and you will see an output page

CHECKING THIRD POSITION %GC BIAS IN YOUR ORF

--Go the the EMBOSS Wobble website at:

PHYLOGENETIC ANALYSIS OF EACH GENE PRODUCT

1. Determine the identity of your ORFs of interest (i.e mexR in P. aeruginosa)

MULTIPLE SEQUENCE ALIGNMENT

In the ‘Add Genomes’, type “Pseudomonas” in the box.

Click “Find Genes”

Add to gene cart

See a new window, Click “View cart in genome browser”.

Click ‘Create a Multiple Sequence Alignment’. Hit ‘Submit job’.

Another site is by Softberry at:

Step 1: Paste in your sequence

Step 2: Select “All”

Examine the Output.

Go back to the Softberry website at:

So now you have:

! -Identified potential ORFs

If you find a match, click on genes of interest in different organisms.

-putative orthologs in other bacteris

KEGG: Kyoto Encyclopedia of Genes and Genomes. KEGG is a database of biological

You might also like