3) 13 Homology and DNA Sequence 2001 The Character Concept in Evolutionary B
3) 13 Homology and DNA Sequence 2001 The Character Concept in Evolutionary B
3) 13 Homology and DNA Sequence 2001 The Character Concept in Evolutionary B
WARD WHEELER
Department of Invertebrate Zoology, American Museum of Natural History, New York, NY 10024
INTRODUCTION
The phylogenetic analysis of DNA sequences, like that of all other comparative
data, is based on schemes of putative homology which are then tested via
congruence to determine synapomorphy schemes and cladistic relationships.
Unlike some other data types, however, the matrix of putative homologies or
"characters" is not directly observable. When sequences are unequal in length,
the correspondences among sequence positions are not preestablished and some
sort of procedure is required to determine which positions are "homologous."
This is the traditional province of multiple sequence aligrmaent (= alignment
here). Alignment generates a collection of column vectors through the insertion
of gaps, which form the character set. Whether accomplished manually, or via
some computational algorithm, these characters are then submitted to
phylogenetic analysis in the same manner as other forms of data. This scheme of
correspondences or putative homologies has two salient features. First,
alignment precedes the phylogenetic analysis (i.e., cladogram search) and is
The Character Concept in Evolutionary Biology
Copyright 9 2001 by Academic Press. All right of reproduction in any form reserved. 303
304 WARD WHEELER
Topology
Alignment ((I ID HI) IV) (((I Ill) II) I V ) (((I IV) II) lid
I GGGG 7 6 8
II --GGG
HI GAAG
IV --GAA
Insertion-Deletion events cost 2
Base changes cost I
F I G U R E 1 Possible alignment for four simple sequences and the cladogram cost (length) for the
possible topologies for these taxa.
The point of this example is that the alignment process yields static
homology schemes which is not optimized for any particular topology. Once the
alignment is determined, all testing of the alignment itself stops. Although
homologies are tested on each cladogram, there may be no single homology
matrix which optimizes Homology (yields the most parsimonious result) for each
cladogram. In order to give each topology its shortest length, homologies need
to be generated which are optimal for that particular topology. It is this need
which motivates the method of"Optimization-alignment" (Wheeler, 1996).
Topology
Alignment ((III) HI) IV) (((I m) II) IV) (((I IV) II) III)
I GGGG 7 6 8
II --GGG
III GAAG
IV --GAA
I GGGG 6 6 8
II GGG--
HI GAAG
IV GAA--
FIGURE 2 Comparison of the implications of two different alignments on the cladogram costs for
the sequences in Fig. 1.
306 WARD WHEELER
Hypothetical
GAAG AncestralSequence Cost
GAA(G) 2
4
GAAG "
GA--A ~ G A ,~A ~j ~ 3
GAAG r
GAb,-- ~ GAA(G) 2
GAAG "
FIGURE 3 An example of HTU sequence optimization at an internal node via the optimization-
alignment (Wheeler, 1996) procexiurc.
13. HOMOLOGYAND DNA SEQUENCE DATA 307
Topology ((I n) HI) IV) (((I HI) II)IV) (((I IV) II)HI)
FIGURE 4 Optimization results for the three topologies and sequences of Fig. 1. Note the possible
ambiguities in HTU sequence determination.
Searching for Homology can directly test this dynamic optimization scheme.
Since the method attempts to more efficiently derive homologies, and
Homologies, it can be tested by parsimony. Cladograms should be more
parsimonious and homoplasy less prominent with dynamic homology. Although
this will have to be tested by multiple data sets, the cases presented by Wheeler
and Hayashi (1998) on chelicerates and the examples in Wheeler (2000) and here
show this pattern of more parsimonious solutions for dynamic homology than for
static alignments.
Both the static and dynamic homologies mentioned earlier rely on a notion
of homology which derives from nucleotide base correspondences. This need
not be the case, however. Homology could be viewed as a phenomenon existing
at the level of the sequences themselves as opposed to base-to-base statements.
This view sees entire strings of DNA nucleotides as characters. Such entities as
the small subunit rDNA locus could be a single character. The locus would then
be homologous among all the taxa, and the actual observed sequences themselves
would constitute the character states. In the context discussed earlier, this would
constitute a "static" homology scheme since the character vectors would be
preordained. The complexity of the sequences would allow for homology
statements more like those of other forms of character analysis, where position
and complexity aid in character delimitation.
308 WARD WHEELER
I I
I1,1, ~0 . . .
Length = Minimum
FIGURE 6 An example of down-pass cladogram optimization via the fixed-state approach.
13. HOMOLOGY AND DNA SEQUENCE DATA 309
I AAATTT L AAATI'T
'I I TTT
II TTT IL AAA
'I II TTT
III AAA III/ AAATTT
'I AAA
Two other aspects of this approach affect ideas of homology. One of the
salient features of base-to-base methods, whether built upon static or dynamic
homologies, is difficulty in tracing complex homologies through the cladogram
(or alignment)---in other words, messy data. When there is extensive sequence
length variation coupled with base changes, tremendous uncertainty in homology
can occur in both multiple-aligmnent and optimization alignment. The
requirement that such variation be accommodated over the entire cladogram can
make local tmcertainties propagate throughout the analysis. Since the sequence
level homology approach transforms the complex states with their variations in
length and nucleotide base composition into simple numbered states with
pairwise costs, this problem does not occur. Such seemingly confusing variation
patterns will certainly lead to longer cladograms, but the homologies (at the
fragment level) will remain clear.
A second feature of fragment level homology is the requirement that the
character homologies be defined a priori. Whether entire loci, structurally or
functionally defined regions are employed as homologies, they are determined by
the investigator. This is akin to the delimitation of variation in complex
morphological features. Are complex structures such as complete development
in the endopterygote insects single or multiple characters? As with all such
seemingly arbitrary decisions, what matters most is the effect of changing these
character delimitations on phylogenetic results.
The notion of synapomorphy as a shared derived feature might also seem to
be altered by the homology concept implicit in sequence fragment comparisons.
Since each taxon may well express a unique character state, it might appear that
synapomorphy (as a shared state) would be impossible. This criticism would
only apply if the characters were completely unordered. State transformation
costs are not equal among states, hence are more akin to synapomorphy in the
context of ordered characters. Two taxa might present states 1 and 2 of an
ordered series 0 --~ 1 ---, 2. These taxa are united by the transformation implied
by the ordering with 1 and 2 sharing special derived similarity not found in 0
(Platnick, 1979). The concept of synapomorphy (or Homology) is unaffected by
the fixed-state approach.
COMPARISONS
cladogram length) cannot be used. The things that are counted are just not the
same. This notion of character congruence, however, can be extended to the
broader concept of congruence among data sets. Character congruence has been
used to discriminate among analysis parameters (Wheeler, 1995; Whiting et al.,
1997; Wheeler and Hayashi, 1998) and could reasonably be used to compare the
behavior of methods (although numerous other means could also be employed).
Two types of congruence measures can be used: character based and
topological. The relative merits and demerits of these approaches have been
explored in the literature (Mickevich and Fan'is, 1981; Wheeler, 1995) and
character congruence will be used here due to its link with parsimony and
combined data analysis. Phylogenetic methods are judged to be superior if they
accommodate variation in multiple data sets efficiently as measured by the
Mickevich-Farris incongruence length metric (Mickevich and Farris, 1981).
EXAMPLE--ARTHROPODS
Mollusca
Cephalopoda Loligo pealei
Polyplacophora Lepidochiton cavernae
Annelida
Polycheata Glycera sp.
Oligocheata Lumbricus terrestris
Hirudinea Haemopis marmorata
Onychophora
Peripatoidae Peripams trinitatis
Peripatopsidae Pe ripatoides novozealandia
Trilobita groundplan of Ramsk61d and Edgecombe, 1991.
(morphological analysis only)
Chelicerata
Pycnogonida Anoplodactylus portus
Xiphosura Limulus polyphemus
Seorpiones Centruroides hentzii
Uropygi Mastogoprocms giganteus
Araneae Nephila clavipes
Araneae Peucetia viridans
Crustacea
312 WARD WHEELER
Three analyses were performed. In each case, the insertion-deletion cost was
set at two and all base substitutions set at one. When morphological characters
were used, character transformations were set at two. In the first analysis, the
data were aligned (via MALIGN; Wheeler and Gladstein, 1994) and
phylogenetic analysis was performed using PHAST (Goloboff, 1996). The
second analysis employed optimization-alignment as implemented in POY
(Gladstein and Wheeler, 1996). The third used the fixed-state optimization
technique also as implemented in POY. Gaps/indels were included and given the
same weight (2) in all length calculations. All searches employed TBR branch
swapping and 10 random addition sequences. The results of the individual data
partitions, combined results, and congruence calculations are stanmarized in
Table II and Figs. 8-10.
This length of 387 steps is shorter than that of the optimization-alignmentpurely due to the treatment
of ambiguities. When all ambiguities are treated as missing data, both alignment (MALIGN-PHAST)
and optimization-alignment (POY) yield the same length of 387 steps.
2This length is 2 times the length of 126 steps.
3Calculated as (Combined - 18S rDNA - Ubiquitm - 28S rDNA - Morphology)/Combined.
13. HOMOLOGYAND DNA SEQUENCE DATA 313
FLepidochiton
~--Loligo
I F Glycera
I FL-H er.opi
II i--Lumbdcus
tJ ,---[- Peripatoides
II L-Peripatus
II _.I -mril~
H I F -'An~176
I H Fumulus
[ L.~ r Centruroides
Y ~ - - Mastogoproctus
Peucetia
t _ Nephila
Callinectes
Balanus
Scutigera
Spirobolus
Thermobius
Heptagenia
Dorocordulia
Libellula
Mantis
Tibicen
Papilio
Drosophila
Lepidochiton A Lepidochiton B
Glycera -- Lepidochiton C
Glycera -.: Loligo
Loligo Loligo -- Peripatoides
Haernopis Haemopis
Lumbricus ~- Peripatus
Lumbricus -- Balanus
I ~ Peripatoides r-I-- Peripatoides
I --" Callinectes
I I =- Peripatus I I " - Peripatus -- Scutigera
L.~ ~ Callinectes L~ ,.-r- Callinectes -- Spirobolus
I I L._ Balanus I I - Balanus -- Glycera
L~ !'- Anoplodactylus i ~ F Anoplodactylus Anoplodactylus
! I r - Limulus I I f-~mulus Limulus
t ~ ~l._Mastogoproctus l_q ~ Mastogoproctus Centruroides
I I I.a- Centruroides I I E! m Centruroides Mastogoproctus
L.J L F - Peucetia I I LF Peucetia Peucetia
I L_. Nephila t_~ t_. Nephila Nephila
I ...J-" Scutigera ,--l'- Scutigera
I __~ Haamopis
U L_ Spirobolus II -Spirobolus Lumbricus
] I"- Thermobius II F "ThermObius
t.~ f _ Heptagenia -~9 Ir--Heptagenia ~ L ~Tibicen
Papilio
U F Mantis L~ ~--Mantis Drosophila
I r " E l - Dorocordulia I ~ Tibicen Mantis
L.~ L_ Libellula Papilio Dorocordulia
I I'- Tibicen I L_ Drosophila Libellula
Papilio L . F Dorocordulia Thermobius
Drosophila L - - Libellula Heptagenia
~ S L ~tpAnOopdlauycslt D
-- Anoplodactylus F
Mastogoproctus Scutigera
Papilio Scutigera
~ ~ S Tibicen
CPBaAnOpoldacytuls E
Balanus ,.4-- Nephila
Libellula ..j t_ Peucetia
Centruroides Papilio
Limulus L r - Spirobolus
Mantis L.. Balanus
Nephila Callinectes -- Mastogoproctus
Peucetia Drosophila " Centruroides
utigera stogoproctus t _ Limulus
irobolus entruroides .._[-- IJbellula
Callinectes Limulus
ellula I L_FPapitio
irobolus .~ "--Drosophila
Mantis /anus I I-" Tibicen
L_I - Drosophila Nephila Callinectes
r Tibicen Peucetia L _ Mantis
~ - Papilio
[ " Thermobius
I--I-" Scutigera
"7 L'-Spir~176
~ r l _ J - - Centruroides
I I -l_r- Glycera
L- i -- Lurnbricus
I !"-Limulus
--
--
Callinectes
Balanus
-- Dorocordulia
-- Libellula
l _ r -Mantis i ~ ,-.!'- Balanus _~_ Scutigera
Heptagenia L~ L_ Dorocordulia Spirobolus
--L.J-- Nephila L J - Callinectes . _ ~ Thermobius
L _ Tibicen L - - Libellula Heptagenia
. ~ ~ ~'~ ~.~
~~.~ ~~~~~.w
"~ ~ ~ ~ ~ ~""[ ~
~.~-~.= _~-~., ~ . ~ I I I I
~ ~ ~. U ! I
I
"-~ 1 ]1 -
,--9 ,,~1 ,. ,
..,I
9 ,...I 1
t~
.,
~L .... I
9,.,I . . J . I
~ ~a.~ -~.o~ I I
~~~-=~_~- ~~e LI-~ ~ iJ
.~ L r J L J I....
~.~L I .. I'
I I ,~ I
FIGURE 10 Cladograms of combined data (18S rDNA, Ubiquitin, 28S rDNA, morphology) for
arthropod taxa when subjected to different analytical techniques. A. Multiple sequence aligmnent. B.
Optimization-alignment. C. Fixexi-statr optimization.
316 WARD WHEELER
DISCUSSION
ACKNOWLEDGMENTS
I would like to acknowledge the contributions of Daniel Janies, Gonzalo Giribet, Norman
Platnick, Lorenzo Prendini, Randall Schuh, Susanne Schulmeister, and Mark Williams to this work
through discussion and abuse. I would also like to thank Portia Rollins for expert art work.
LITERATURE CITED
Gladstein, D. S., and Wheeler W. C. (1997). "POY: The Optimization of Alignment Characters."
Program and Documentation. New York, NY. Available at "ftp.amnh.org"/pub/molecular.
Goloboff, P. (1996). PHAST. Program and Documentation. Version 1.5.
Miekevich, M. F., and Farris, S. J. (1981). The implications of congruence in Menidia. Syst. Zool.
30:351-370.
Platnick, N. I. (1979). Philosophy and the transformation ofcladistics. Syst. ZooL 28:537-546.
Ramsk61d, L., and Edgecombe, G. D. (1991). Trilobite monophyly revisited. Hist. Biol. 4:267-283.
Wheeler, W. C. (2000). Heuristic reconstruction of hypothetical-ancestral DNA sequeces: sequence
alignment versus direct optimization. In "Homology and Systematics: Coding Characters for
Pylogenetic Analysis" (R. W. Scotland, ed.), pp. 106-113. Taylor and Francis, London.
13. HOMOLOGY AND DNA SEQUENCE DATA 317
Wheeler, W. C. (1999). Fixed character states and the optimization of molecular sequence data.
Cladistics 15:379-385.
Wheeler, W. C. (1996) Optimization alignment: the end of multiple sequence alignment in
phylogentics? Cladistics 12:1-10.
Wheeler, W. C. (1995). Sequence alignment, parameter sensitivity, and the phylogentic analysis of
molecular data. Syst. Biol. 44:321-332.
Wheeler, W. C., and Gladstem, D. S. (1994). MALIGN: A multiple sequence alignment program. J.
Hered. 85:417.
Wheeler, W. C., and Gladstein, D. M. (1992-1996). Malign: A Multiple Sequence Alignment
Program. Program and Documentation. New York, NY. available ftp,amnh.org
/pub/molecular/malign
Wheeler, W. C., and Hayashi, C. Y. (1998). The phylogeny of the chelicerate orders. Cladistics
24:173-192.
Wheeler, W. C., Cartwright, P., and Hayashi, C. (1993). Arthropod phylogenetics: a total evidence
approach. Cladistics 9:1-39.
Whiting, M. F., Carpenter, J. C., Wheeler, Q. D., and Wheeler, W. C. (1997). The Strepsiptera
problem: phylogeny of the holometabolous insect orders inferred from 18S and 28S ribosomal
DNA sequences and morphology. Syst. Biol. 46:1-68.