Chao Jaccard, Chao Sorensen
Chao Jaccard, Chao Sorensen
Chao Jaccard, Chao Sorensen
net/publication/236733729
CITATIONS READS
1,166 1,164
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Anne Chao on 10 September 2015.
LETTER
A new statistical approach for assessing similarity
of species composition with incidence and
abundance data
Abstract
1 2*
Anne Chao, Robin L. Chazdon, The classic Jaccard and Sørensen indices of compositional similarity (and other indices
Robert K. Colwell2 and that depend upon the same variables) are notoriously sensitive to sample size, especially
Tsung-Jen Shen1 for assemblages with numerous rare species. Further, because these indices are based
1
Institute of Statistics, National solely on presence–absence data, accurate estimators for them are unattainable. We
Tsing Hua University, Hsin-Chu,
provide a probabilistic derivation for the classic, incidence-based forms of these indices
Taiwan
2
and extend this approach to formulate new Jaccard-type or Sørensen-type indices based
Department of Ecology and
on species abundance data. We then propose estimators for these indices that include the
Evolutionary Biology, University
of Connecticut, Storrs, CT, USA
effect of unseen shared species, based on either (replicated) incidence- or abundance-
*Correspondence: E-mail:
based sample data. In sampling simulations, these new estimators prove to be
chazdon@uconn.edu considerably less biased than classic indices when a substantial proportion of species are
missing from samples. Based on species-rich empirical datasets, we show how
incorporating the effect of unseen shared species not only increases accuracy but also
can change the interpretation of results.
Keywords
Abundance data, beta diversity, biodiversity, complementarity, incidence data, shared
species, similarity estimators, similarity index, species overlap, succession.
oldest and most widely used similarity indices for assessing conclusion for several datasets, based on rarefaction tests.]
compositional similarity of assemblages (sometimes called Moreover, for the new indices we present here, it can be
Ôspecies overlapÕ) and hence, its complement, dissimilarity. shown theoretically that sampling bias, when present, is
Both measures are based on the presence/absence of always negative. [The authors demonstrate the expected
species in paired assemblages and are simple to compute negative bias mathematically (A. Chao, R. L. Chazdon,
(Magurran 2004). Many other similarity indices exist that are R. K. Colwell & T.-J. Shen, unpublished data); it can be
based on the same information: the number of species proved for any abundance models given in Magurran (2004)
shared by two samples and the number of species unique to and Plotkin & Muller-Landau (2002).]
each of them (Legendre & Legendre 1998), and new indices Recently, interest has intensified in the development and
continue to appear (e.g. Lennon et al. 2001). A modified evaluation of indices to measure beta diversity, or turnover
version of the Sørensen index was developed by Bray & rate, of species assemblages (Duivenvoorden 1995; Lennon
Curtis (1957), based on abundance data (also known as the et al. 2001; Arita & Rodrı́guez 2002, 2004; Condit et al. 2002;
Sørensen abundance index; Magurran 2004), and a large Plotkin & Muller-Landau 2002; Koleff et al. 2003; Rodrı́guez
number of other abundance-based indices have been & Arita 2004), underscoring the need for robust statistical
developed (Legendre & Legendre 1998), including the estimators for inferring compositional similarity from sample
widely applied Morisita–Horn index (Magurran 2004). data. Increasing species turnover (decreasing similarity) with
Despite their wide application in ecological studies, the increasing distance between sites may reflect spatial patterns
classic Jaccard and Sørensen indices, when computed for of dispersal or may be driven by increasing environmental
sample data, perform poorly as measures of similarity heterogeneity at greater scales (Harte et al. 1999; Hubbell
between diverse assemblages that include a substantial 2001; Balvanera et al. 2002; Chave & Leigh 2002; Condit et al.
fraction of rare species (Wolda 1981; Colwell & Coddington 2002; Duivenvoorden et al. 2002; Ruokolainen & Tuomisto
1994; Plotkin & Muller-Landau 2002), because the sample 2002; Rodrı́guez & Arita 2004; Valencia et al. 2004).
data are (usually wrongly) assumed to be true and complete Unfortunately, most indices of beta diversity rely on the
representations of assemblage composition. [Indeed, with same information as the classic Jaccard and Sørensen indices
very few exceptions (e.g. Grassle & Smith 1976; MacKenzie and share the limitations discussed above.
et al. 2004), nearly all existing approaches to measuring With this problem in mind, Plotkin & Muller-Landau
similarity make this assumption.] In general, as we will show (2002) developed a Sørensen-type similarity index for
with simulations, these measures are likely to severely abundance counts using a ÔparametricÕ approach that relies
underestimate true similarity between two (genuinely sim- on a gamma distribution to characterize species abundance
ilar) assemblages that contain numerous rare species. structure. Condit et al. (2002) adopt an approach to
Because many species are missed by the samples, the rare measuring beta diversity using Leigh et al.Õs (1993) Ôcodom-
species that appear in one sample are likely to be different inanceÕ index F, the probability that two individuals chosen
than the rare species that show up in the other sample, even randomly from each of two assemblages are the same
if all are actually present in both assemblages. Similar species. Although this measure is based on abundance data,
problems arise from comparing two samples of substantially F, itself, is not a statistically valid index of similarity. For two
different size: simply because it contains fewer individuals or identical assemblages with many species, F tends to 0.
sampling units, the smaller sample may lack species that Moreover, it is possible for any two identical assemblages to
appear in the larger sample. In short, the underestimation of have any value of F from 0 to 1, depending on how many
similarity occurs because of the failure to account for unseen species are present and patterns of relative abundance. It is
shared species. possible, however, to normalize F to produce a valid
In principle, overestimation of similarity can also occur similarity index. Chave & Leigh (2002) point out that the
when comparing undersampled, high-dominance commu- Morisita–Horn index is a normalized version of F.
nities in which the common species are widespread and rare We begin by developing a new, probabilistic approach for
ones tend to be locally endemic. In this case, two samples the classic Jaccard and Sørensen incidence-based indices.
might yield the same few common species, but fail to reveal We then extend this approach to formulate Jaccard-type and
rare species that would differentiate the assemblages in Sørensen-type indices that consider species abundances. In
larger samples (Colwell & Coddington 1994; Ruokolainen & contrast to Plotkin & Muller-Landau (2002), we adopt a
Tuomisto 2002 discuss a possible example). In nearly all non-parametric approach that does not require any
cases we have examined quantitatively, however, rarity assumptions about species abundance distributions. We
(either in nature or because of small sample size) increases then propose a method to estimate both incidence-based
the chance that a species will be spuriously absent from one and abundance-based Jaccard and Sørensen indices from
sample but not from the other, thus negatively biasing sample data, incorporating the effect of unseen shared
similarity indices. [Fisher (1999, Fig. 8) comes to the same species.
We then carry out sampling simulations with empirical correspond to the A ¼ S12, B ¼ S1 ) S12 and C ¼
data sets to assess the relative performance of the S2 ) S12. Substituting these expressions in eqns 1 and 2,
classic Jaccard and Sørensen indices; their new, abun- we have an alternate way to write the classic indices that
dance-based Jaccard and Sørensen counterparts; and the will be required for the next steps in developing the new
corresponding Jaccard and Sørensen estimators. We show indices:
that incorporating the effect of unseen species substantially
A S12
reduces the sample-size bias of these estimators and Jclas ¼ ¼ ð3Þ
improves their suitability for inferring similarity (or its A þ B þ C S1 þ S2 S12
complement, dissimilarity) between hyper-diverse assem- and
blages for which a large proportion of species are missing
2A 2S12
from samples. Finally, we illustrate an application of the new Lclas ¼ ¼ : ð4Þ
abundance-based Jaccard index and the Jaccard abundance- 2A þ B þ C S1 þ S2
based estimator, using data from a successional study of
tree, sapling and seedling abundance of canopy species. A probabilistic approach to the classic Jaccard
Based on data sets for rich, tropical insect and plant and Sørensen indices
assemblages, we show how incorporating the effect of
The classic Jaccard and Sørensen indices consider only the
unseen shared species not only increases accuracy, but also
presence or absence (incidence) of species. Two pairs of
can change the interpretation of results.
assemblages, one pair sharing abundant species but not rare
ones and the other pair sharing rare species, but not
DEVELOPING THE NEW INDICES AND common ones, will yield the same index value. From the
ESTIMATORS point of view of overall assemblage similarity, taking
similarity of assemblage composition to the level of
The classic Sørensen and Jaccard similarity indices
individuals often makes more sense (Magurran 2004). Our
The classic Sørensen and Jaccard indices depend on three next objective is to extend the incidence indices to take
simple incidence counts: the number of species shared by account of the relative abundance of species, a prerequisite
two assemblages and the number of species unique to each for developing index estimators for sampling data that take
of them. It has become traditional to refer to these counts as account of unseen rare species.
A, B and C, respectively (Table 1). The classic Jaccard and We must first provide a probabilistic derivation of the
Sørensen indices for incidence counts are then classic Jaccard and Sørensen incidence indices. Suppose we
randomly select a species from Assemblage 1 and a species
A from Assemblage 2 and then classify each member of the
Jclas ¼ ð1Þ
AþBþC pair according to whether it is a shared species or not. The
and corresponding probabilities are shown graphically in Fig. 1
and specified in Table 2.
2A
Lclas ¼ ð2Þ Although the probabilities in Table 2 are not counts, they
2A þ B þ C can be thought of as Ônormalized counts,Õ because they sum
(We use L for the Sørensen index to avoid confusion to unity. Substituting these probabilities into eqns 1 and 2,
with S for species.) There is a close, monotonic relation then we have
between the two indices: Lclas ¼ 2Jclas/(Jclas + 1) and
Jclas ¼ 1/(2/Lclas ) 1). Jclas ¼
A
AþBþC
Assume that there are S1 species in Assemblage 1 and S2
½ðS12 =S1 ÞðS12 =S2 Þ
species in Assemblage 2. Let the number of shared species ¼
½ðS12 =S1 ÞðS12 =S2 Þ þ ½ðS12 =S1 Þð1 ðS12 =S2 ÞÞ þ ½ð1 ðS12 =S1 ÞÞðS12 =S2 Þ
be S12. Then, the incidence counts A, B, C in Table 1 S12
¼
S1 þ S2 S12
Table 1 Species classification counts used in the classic indices which is exactly eqn 3. Likewise, we have
Assemblage 2 2A
Lclas ¼
2A þ B þ C
Present Absent 2½ðS12 =S1 ÞðS12 =S2 Þ
¼
2½ðS12 =S1 ÞðS12 =S2 Þ þ ½ðS12 =S1 Þð1 ðS12 =S2 ÞÞ þ ½ð1 ðS12 =S1 ÞÞðS12 =S2 Þ
Assemblage 1 2S12
Present A B ¼
S1 þ S2
Absent C –
which is the same as eqn 4.
Table 2 Probabilistic derivation of species counts for the classic common and some are rare. Instead, the basic idea for
indices handling abundance counts is that we treat all individuals
equally. Adapting the approach from the previous section,
Select any species from Assemblage 2
we randomly select one individual from Assemblage 1 and
Shared Non-shared one individual from Assemblage 2. For each individual of the
pair, note whether it belongs to a shared species or not.
Select any species from Assemblage 1
Shared A ¼ SS121 SS122 B ¼ SS121 1 SS122
We now derive the general formulas for the abundance-
(Case 1) (Case 2) based versions of the Jaccard and Sørensen indices.
Without loss of generality, we assume the first S12 species
Non-shared C ¼ 1 SS121 SS122 1 SS121 1 SS122 are shared species, that is, the shared species are indexed
(Case 3) (Case 4) by 1,2,…,S12. In Assemblage 1, let U denote the total
relative abundances of individuals belonging to the shared
species, U ¼ p1 + p2 + + pS12. Likewise in Assemblage
It might appear that we have made no progress, but 2, let V denote the total relative abundances of individuals
this probabilistic approach lays the groundwork for belonging to shared species, V ¼ p1 + p2 + + pS12.
developing abundance-based indices, which in turn allow Table 3 shows the probabilities that two individuals, one
for the estimation of indices that take into account the from each assemblage, represent each of the usual four
effect of unseen shared species. Note that, using this categories.
approach, we can also calculate the chance that both Based on eqns 1 and 2 for the three probabilities (A, B
randomly chosen species are non-shared species (Case 4 and C in Table 3), we obtain the following abundance-based
as shown in Fig. 1 and Table 2). However, the basic indices in terms of U and V:
concept for the Jaccard and Sørensen indices is
based only on information for the other three cells A UV
Jabd ¼ ¼ ð5Þ
(Cases 1–3). A þ B þ C U þ V UV
X D12
^ inc ¼
D12
Xi ðz 1Þ fþ1 X Xi The tests
U þ I ðYi ¼ 1Þ ð11Þ
i¼1
n z 2fþ2 i¼1 n Although the classic Jaccard and Sørensen indices and our
new indices all measure Ôsimilarity,Õ they are intended to
and
measure different aspects of this construct: the classic indices
X D12
^ inc ¼
D12
Yi ðw 1Þ f1þ X Yi ostensibly measure similarity in species composition while
V þ I ðX i ¼ 1 Þ ð12Þ ignoring relative abundance (although they are strongly
i¼1
m w 2f2þ i¼1 m
affected by it, when sampling is involved), whereas our new
(The same modifications described for eqns 7 and 8 may be indices [and many others (Legendre & Legendre 1998;
applied here if f+2 ¼ 0 or f2+ ¼ 0.) Thus, our proposed Magurran 2004)] explicitly consider relative abundance.
incidence-based Jaccard and Sørensen estimators are Thus, for any particular data set, differences in the absolute
magnitude of incidence- vs. abundance-based Jaccard or
^ inc V
U ^ inc
^Jinc ¼ ð13Þ Sørensen values (or indeed, differences between most other
U ^ inc U
^ inc þ V ^ inc V
^ inc indices of similarity) are meaningless, in themselves.
Nevertheless, indices of compositional similarity can be random samples of a single sampling pool? If an index is
compared in terms of their performance in tests of unbiased by sample size, it should yield a value of 1 when
sensitivity to undersampling. Using the ant data, we illustrate applied to samples of any size. First, we randomly sampled
three tests: (1) Test 1: equal-sized samples from a single data individuals (with replacement) from the pooled ant data for
set (within-assemblage rarefaction); (2) Test 2: unequal-sized a single collecting method to produce pairs of samples
samples from a single data set; and (3) Test 3: equal- having the same number of individuals as the pools
proportion samples from two data sets (between-assemblage themselves (full samples). Next, we randomly selected
rarefaction). For purposes of these tests, we treated the ant smaller samples, each totalling one-half the number of
data from each collecting method (Berlese, Malaise, or individuals in the original sampling pool, then computed
Fogging) as a separate, complete Ôassemblage,Õ referred to similarity indices for this sample pair. We then repeated this
here as a sampling pool. Samples of specified sizes (in terms of procedure for a pair of samples each 1/4 the size of the
numbers of individuals) were then selected, at random, with original pool, then a pair 1/8 the size of the pool, and so on,
replacement, from these pools. Of course, not all species successively halving sample size, down to 1/64 the original
present in a sampling pool are represented in smaller number of individuals. (Note that this is quite a severe test
samples. However, because sampling was done with of undersampling bias, even for these very large pools.) This
replacement, not all species are present even when the entire process was repeated 1000 times and means taken, for
number of individuals selected is the same as the number of each test of each index, and for each of the three ant
individuals in the pool. collecting methods.
Figure 2 shows representative results of this test for the
classic Jaccard and Sørensen indices (first column of panels,
RESULTS Test 1: Berlese rarefaction). Clearly both of these indices
were quite sensitive to undersampling. Figure 3 (first column
Test 1: Equal-sized samples from a single data set
of panels) shows the corresponding results for the new
All similarity indices yield a true value of 1 when a complete indices for this test. The new abundance-based Jaccard and
sampling pool (assemblage) is compared with itself. What Sørensen indices, without adjustment for unseen shared
happens when a similarity index is computed for two species (Jabd and Labd), were also sensitive to sample size. In
1.0
0.8
Sørensen
Lclas
0.6
0.4
0.2
0
Full
1/16
1/32
1/64
1/2
1/4
1/8
Figure 2 Random sampling tests of the classic Jaccard (Jclas, eqn 1) and Sørensen (Lclas, eqn 2) overlap indices. The graphs show the effect
on each index of considering random samples composed of 1/1 (Full), 1/2, 1/4, …, 1/64 of the abundances or incidence-equivalents in the
sampling pools, sampled with replacement. (The labels on the lower left graph are the same for all graphs.) Column 1 (Test 1: Berlese
rarefaction) shows similarity index values for equal-sized, paired samples from the Berlese ant data set. Column 2 (Test 2: Berlese unequal)
shows index values for comparisons of samples of decreasing size vs. a sample of the same size as the full Berlese ant data set. Column 3
(Malaise–Fog rarefaction) shows similarity index values for equal-proportion, paired samples (Test 3) from the Malaise vs. the Fogging ant
data set, a high-similarity comparison. Column 4 (Malaise–Berlese rarefaction) shows similarity index values for equal- proportion, paired
samples (Test 3) from the Berlese vs. the Malaise ant data set, a low-similarity comparison. The true value of each index for the sampling
pools considered are shown by horizontal dotted lines in the columns for Test 3 (Malaise–Fog and Malaise–Berlese rarefaction). The true
index value for Test 1 and Test 2 is 1.0, the top of the graphs.
Jabd
Jaccard
Jabd
^
Jinc
^
Labd
Sørensen
Labd
^
1.0
0.8
Linc
0.6
^
0.4
0.2
0
Full
1/2
1/4
1/8
1/16
1/32
1/64
Figure 3 Random sampling tests the new overlap indices. The graphs show the effect on each index of considering random samples
composed of 1/1 (Full), 1/2, 1/4, …, 1/64 of the abundances or incidence-equivalents in the sampling pools, sampled with replacement.
(The labels on the lower left graph are the same for all graphs.) Columns are described in the caption for Fig. 2. Jaccard indices: Jabd is the new
abundance-based Jaccard index, not adjusted for unseen species, computed by eqn 5. ^Jabd is the corresponding abundance-based estimator
that takes unseen species into account, computed by eqn 9. The estimator based on replicated incidence data, ^Jinc , is computed by eqn 13.
Sørensen indices: Labd is the new abundance-based Sørensen index, not adjusted for unseen species, computed by eqn 6. L ^ abd is the
corresponding abundance-based estimator that takes unseen species into account, computed by eqn 10. The estimator based on replicated
incidence data, L ^ inc , is computed by eqn 14. The true value of each index for the sampling pools considered are shown by horizontal dotted
lines in the columns for Test 3 (Malaise–Fog and Malaise–Berlese rarefaction). The true index value for Test 1 and Test 2 is 1.0, the top of the
graphs. To allow a valid comparison of the incidence-based estimators (^Jinc and L ^ inc ) with the corresponding abundance-based estimators
(^Jabd and L
^ abd , respectively), the X-axis for each incidence-based estimator was re-scaled so that the minimum number of incidences matches
the minimum abundance of the corresponding abundance-based estimator, thus equalizing the amount of statistical information.
contrast, the Jaccard and Sørensen estimators, which include for unseen species (^Jabd and L ^ abd in third and fourth
the estimated effect of unseen shared species, proved to be columns of Fig. 3) as well as for the corresponding
less sensitive to undersampling, remaining substantially estimators based on replicated incidence data (^Jinc and L
^ inc
closer to 1 even for small samples (Fig. 3). This was true in third and fourth columns of Fig. 3).
for both the abundance-based estimators (^Jabd and L
^ abd ) and
the estimators based on replicated incidence data (^Jinc and
^ inc ). APPLICATION
L
As an example of the application of the new indices, we
apply the classic Jaccard index (eqn 1), the new abundance-
Test 2: Unequal-sized samples from a single data set
based Jaccard index (eqn 5) and its estimator (eqn 9) to
A similarity index should ideally be robust to sample size not data from two mature and four second-growth rainforest
only for equal-sized samples, but also for samples of sites in Costa Rica. We examine compositional similarity
unequal size. To test for this property we computed between species of trees ‡ 25 cm diameter at breast height
similarity indices for samples of successively smaller size, vs. (DBH; canopy individuals), canopy tree saplings (1–5 cm
ÔfullÕ samples, equal in number of individuals to the number DBH) and canopy tree seedlings (> 20 cm height, but
in the corresponding sampling pool. As with the first test, an < 1 cm DBH) within four second-growth forests of
ideal index should remain at 1, regardless of the discrepancy different age since pasture abandonment and in two old-
in sample sizes. Figures 2 and 3 (second column, Test 2: growth forests in the same study area. During early stages of
Berlese unequal) show such a test for the Berlese sample ant succession, when the forest canopy is first beginning to
data, using samples created by the same scheme outlined for close, fast-growing, shade-intolerant colonizing tree species
the first method. Even more than in the first test, the classic are present as canopy trees and are also found as smaller
Jaccard and Sørensen indices (Fig. 2) were strongly affected individuals in the understory, as seedlings and saplings. As
by the size of the sample, leading to a severe negative time progresses and the understory becomes more shaded,
bias when one sample was markedly smaller than the these shade-intolerant tree species are eliminated from the
full sample. In contrast, the new Jaccard and Sørensen seedling and sapling pool and shade-tolerant species readily
estimators (Fig. 3, second column) were strikingly resistant colonize these small size classes. These shade-tolerant
to undersampling, including both abundance-based estima- species are represented by seedlings and saplings, but have
tors (^Jabd and L
^ abd ) and the estimators based on replicated few or no canopy trees present, gradually augmenting tree
incidence data (^Jinc and L^ inc ). species richness as the forest matures (Guariguata et al.
1997; Table 4). Thus, we would predict that, as secondary
forests mature, compositional similarity between tree species
Equal-proportion samples from two data sets
It is all very well for a similarity index to be robust to sample
size in comparing paired samples from the same pool, but Table 4 Observed patterns of species richness of tree seedlings,
saplings and canopy individuals in 1 ha plots in four second-
an index is of little use if it does not retain that robustness in
growth and two old-growth forests in year 2000
comparing different data sets, while successfully detecting
compositional differences between them. We performed the Sobs Sobs Sobs
same sample size comparison procedures described for the Site Age seedlings saplings canopy trees
first set of tests, but instead of comparing sample pairs from
LSUR 15 45 68 12
the same sampling pool, we compared successively smaller
TIR 18 49 74 16
sample pairs from the Malaise and Fogging [high similarity LEP 23 47 67 24
(Longino et al. 2002)], and from the Malaise and Berlese CR 28 57 91 33
(low similarity) data sets. The results for the classic Jaccard LSUR old-growth > 200 47 101 37
and Sørensen indices appear in the third and fourth columns LEP old-growth > 200 69 102 43
of Fig. 2. An ideal index would yield and maintain the true
value computed for the full pools (the dotted horizontal line All trees and saplings were marked and measured for diameter
within a 1 ha plot in each forest. Seedlings were sampled in 144
in each panel) in the face of rarefaction. The classic Jaccard
1 · 5 m quadrats within the 1 ha plot, for a total area sampled of
and Sørensen indices proved quite sensitive to undersam-
0.072 ha. In these analyses, we included only canopy tree species;
pling in this test (Fig. 2). The new abundance-based Jaccard shrubs, treelets and midstory trees were excluded. Note that young
and Sørensen indices, uncorrected for unseen species (Jabd sites show a low number of canopy tree species per ha (individuals
and Labd in third and fourth columns of Fig. 3), also suffer ‡ 25 cm DBH) and fewer sapling species compared with old-
from undersampling bias, but the bias is quite substantially growth forests, but differences in seedling species richness were
reduced for their abundance-based counterparts corrected less pronounced.
and seedlings or saplings would initially be high, but would The abundance-based Jaccard index (eqn 5) showed a
quickly decline to a minimum during intermediate stages of strikingly different pattern across the six forest stands.
succession and then begin to increase later in succession as Compositional similarity between seedling and tree assem-
shade-tolerant trees reach reproductive maturity and pro- blages and between sapling and tree assemblages was
duce seedlings that can establish, grow and survive. initially high in the youngest stand, as we had predicted. As
The classic Jaccard index (eqn 1) showed low compo- the forest matures, tree seedling and sapling pools become
sitional similarity between trees and seedlings for the four enriched by shade-tolerant species not represented as
second-growth forests compared with the old-growth canopy trees, resulting in a decreasing compositional
forests, with similarity decreasing slightly with age among similarity that reached a minimum in the 23-year-old
the four second-growth forests (Fig. 4). Similarity between LEP stand (Fig. 4). This minimum similarity represents a
trees and saplings, in contrast, showed gradual increases point in forest succession of maximum recruitment
from the youngest forest to the older second-growth forest, limitation for both seedlings and saplings. In the oldest
continuing the trend to old-growth forests (Fig. 4). second-growth plot, CR, the abundance-based Jaccard
index began to increase, reflecting recruitment of shade-
tolerant species in all three-size classes (Fig. 4). The
similarity index continued to increase and stabilized at
Seedlings vs.trees
0.3 0.4–0.5 in the two old-growth stands. With the exception
Saplings vs.trees
of one old-growth stand, similarity indices were higher for
0.2 seedlings vs. trees than for saplings vs. trees. At the scale
Jclas
0.3
during forest succession.
0.2
The abundance-based Jaccard estimator (eqn 9), which
0.1 incorporates the effects of unseen shared species, showed
0 similar general trends across stands when compared
with the abundance-based Jaccard index (Fig. 4). The
0.7
28-year-old second-growth stand, however, had nearly
0.6
comparable estimates of similarity compared with the
0.5 two old-growth stands, suggesting that the estimator is
J abd
0.3 between the size classes (Fig. 4). The estimator for sapling
0.2
vs. tree similarity was higher than for seedling vs. trees in
the TIR second-growth site, indicating that this stand has
0.1
more rare species of shared saplings than seedlings.
0
Forest site LSUR TIR LEP CR LSUR LEP
Forest age (year) 15 18 23 28 Old growth CONCLUSIONS
Figure 4 Compositional similarity between canopy trees and Because similarity is a qualitative human construct, it has no
seedlings and canopy trees and saplings in four second-growth precise mathematical definition. Nevertheless, measuring
forests of increasing age and in two old-growth forests. Results are ÔsimilarityÕ relies on quantitative indices devised for the
shown for Jclas, the classic Jaccard index (eqn 1; top panel), for the purpose, and in practice, we may expect that similarity
new abundance-based Jaccard index, Jabd (eqn 5) not adjusted for
indices fulfil reasonable criteria for their mathematical
unseen species (middle panel), and for ^Jabd , the new abundance-
behaviour (Legendre & Legendre 1998). Given indices that
based Jaccard estimator that takes unseen species into account
(eqn 9; error bars are 1 SE, computed by a bootstrapping make sense mathematically, it is their statistical performance
procedure; details available from the first author; A. Chao, R. L. under the realities of field sampling that we have concerned
Chazdon, R. K. Colwell & T.-J. Shen, unpublished data). These ourselves with here, particularly for species-rich taxa for
analyses include only canopy tree species; shrubs, treelets and which complete inventories are impractical or even
midstory tree species were excluded. impossible.
Using sampling simulations applied to representative for sharing vegetation data for tree species in mature forests.
field data sets, we confirmed that two of the most widely The new estimators presented in this paper are included in
used classic indices, Jaccard and Sørensen, are negatively version 7.5 of ESTIMATES (Colwell 2004) and the program
biased under conditions of undersampling, often quite SPADE (Chao & Shen 2003), to be released upon publication
substantially (Fig. 2). Our objective was to develop new, of this paper. The complete derivation of eqns 7 and 8 and
probability-based indices that reduce undersampling bias by the variance estimators for eqns 9 and 10 are available upon
estimating and compensating for the effects of unseen, request from the first author. The complete ant data sets are
shared species. We based a new similarity index on the available from RKC.
probability that two randomly chosen individuals, one from
each of two samples, both belong to any of the species
REFERENCES
shared by the two samples [not necessarily to the same
shared species, the basis of F (Chave & Leigh 2002; Condit Arita, H.T. & Rodrı́guez, P. (2002). Geographic range, turnover
et al. 2002) and the Morisita–Horn index]. This approach rate and the scaling of species diversity. Ecography, 25, 541–550.
Arita, H.T. & Rodrı́guez, P. (2004). Local–regional relationships
opened the way to the crucial step, adjusting this probability
and the geographical distribution of species. Global Ecol. Biogeogr.,
to account for the chance that larger samples would reveal a 13, 15–21.
larger proportion of shared species. As anticipated, the new Balvanera, P., Lott, E., Segura, G., Siebe, C. & Islas, A. (2002). Beta
indices consistently reduced undersampling bias in the per- diversity patterns and correlates in a tropical dry forest of
formance tests, in most circumstances quite substantially. Mexico. J. Veg. Sci., 13, 145–158.
Inevitably some bias remains, especially under severe Bray, J.R. & Curtis, J.T. (1957). An ordination of the upland forest
undersampling and for highly dissimilar samples. Under communities of southern Wisconsin. Ecol. Monogr., 27, 325–349.
Bunge, J. & Fitzpatrick, M. (1993). Estimating the number of
such conditions, relatively little information exists to guide
species: a review. J. Am. Stat. Assoc., 88, 364–373.
bias reduction. Chao, A. (in press). Species richness estimation. In: Encyclopedia of
Ecologists distinguish two aspects of the compositional Statistical Sciences, 2nd edn (eds Balakrishnan, N., Read, C.B. &
similarity of species assemblages: similarity of species lists Vidakovic, B.). Wiley Press, New York, NY, USA.
(incidence) and similarity of speciesÕ relative abundances. Chao, A. & Shen, T.J. (2003). Program SPADE (Species Prediction
Classic abundance-based indices (e.g. Morisita–Horn or and Diversity Estimation). Program and User’s Guide available
Bray–Curtis) match abundances, species-by-species. Our at http://chao.stat.nthu.edu.tw.
Chao, A., Ma, M.-C. & Yang, M.C.K. (1993). Stopping rules and
new indices take an intermediate path, by assessing the
estimation for recapture debugging with unequal failure rates.
probability that individuals belong to shared vs. unshared Biometrika, 80, 193–201.
species, without regard to which species they belong to. Chave, J. & Leigh, E.G. (2002). A spatially explicit neutral model of
Unfortunately for many studies, unreplicated, pure incidence beta-diversity in tropical forests. Theor. Pop. Biol., 62, 153–168.
data (pairs of species lists) provide no information that can Chazdon, R.L., Colwell, R.K., Denslow, J.S. & Guariguata, M.R.
be used to estimate the number of unseen, shared species. (1998). Statistical methods for estimating species richness of
In principle, it may be possible to derive estimators that use woody regeneration in primary and secondary rain forests of NE
Costa Rica. In: Forest Biodiversity Research, Monitoring and Modeling:
abundance data to correct pure incidence similarity indices
Conceptual Background and Old World Case Studies. (eds Dallmeier,
for unseen species, but it is currently statistically difficult for F. & Comiskey, J.). Parthenon Publishing, Paris, France, pp.
biologically realistic data. However, we recommend the new 285–309.
indices for any application in which not only species Colwell, R.K. (2004). ESTIMATES: Statistical Estimation of Species
matching but similarity of relative abundance is of interest. Richness and Shared Species from Samples, Version 7.5.
Moreover, these new indices are better suited than the Available at http://viceroy.eeb.uconn.edu/estimates. Persistent
corresponding classic indices for assessing compositional URL http://purl.oclc.org/estimates.
Colwell, R.K. & Coddington, J.A. (1994). Estimating terrestrial
similarity between samples that differ in size, are known or
biodiversity through extrapolation. Phil. Trans. R. Soc. Lond. B
suspected to be undersampled, or are likely to contain Biol. Sci., 345, 101–118.
numerous rare species. Colwell, R.K., Mao, C.X. & Chang, J. (2004). Interpolating,
extrapolating, and comparing incidence-based species accumu-
lation curves. Ecology, 85, 2717–2727.
ACKNOWLEDGEMENTS Condit, R., Pitman, N., Leigh, E.G., Jr, Chave, J., Terborgh, J.,
We thank three anonymous referees for their comments and Foster, R.B. et al. (2002). Beta-diversity in tropical forest trees.
Science, 295, 666–669.
suggestions. This work was supported by Taiwan National
Duivenvoorden, J.F. (1995). Tree species composition and rain
Science Council Contract NSC92-2118-M007-013 to forest-environment relationships in the middle Caquetá area,
A. Chao and T.-J. Shen, by a grant from the Andrew Colombia, NW Amazonia. Vegetatio, 120, 91–113.
W. Mellon Foundation to R. L. Chazdon, and by US-NSF Duivenvoorden, J.F., Svenning, J.-C. & Wright, S.J. (2002). Beta
grant DEB-0072702 to R. K. Colwell. We thank Jorge Leiva diversity in tropical forests. Science, 295, 636–637.
Fisher, B.L. (1999). Improving inventory efficiency: a case study Longino, J.T., Coddington, J. & Colwell, R.K. (2002). The ant
of leaf-litter ant diversity in Madagascar. Ecol. Appl., 9, 714– fauna of a tropical rain forest: estimating species richness three
731. different ways. Ecology, 83, 689–702.
Grassle, J.F. & Smith, W. (1976). A similarity measure sensitive to MacKenzie, D.I., Bailey, L.L. & Nichols, J.D. (2004). Investigating
the contribution of rare species and its use in investigation of species co-occurrence patterns when species are detected im-
variation in marine benthic communities. Oecologia, 25, 13–22. perfectly. J. Anim. Ecol., 73, 546–555
Guariguata, M.R., Chazdon, R.L., Denslow, J.S., Dupuy, J.M., Magurran, A.E. (2004). Measuring Biological Diversity. Blackwell,
Anderson, L. (1997). Structure and floristics of secondary and Oxford.
old-growth forest stands in lowland Costa Rica. Plant Ecology, Plotkin, J.B. & Muller-Landau, H.C. (2002). Sampling the species
132, 107–120. composition of a landscape. Ecology, 83, 3344–3356.
Harte, J., Kinzig, A. & Green, J. (1999). Self-similarity in the dis- Rodrı́guez, P. & Arita, H.T. (2004). Beta diversity and latitude in
tribution and abundance of species. Science, 284, 334–336. North American mammals: testing the hypothesis of covaria-
Hubbell, S.P. (2001). A Unified Neutral Theory of Biodiversity and tion. Ecography, 27, 1–11.
Biogeography. Princeton University Press, Princeton, NJ. Ruokolainen, K. & Tuomisto, H. (2002). Beta-diversity in tropical
Koleff, P., Gaston, K.J. & Lennon, J.J. (2003). Measuring beta forests. Science, 297, 1439a.
diversity for presence–absence data. J. Anim. Ecol., 72, 367–382. Valencia, R., Foster, R.B., Villa, G., Condit, R., Svenning, J.-C.,
Lee, S.-M. & Chao, A. (1994). Estimating population size via Hernández, C. et al. (2004). Tree species distributions and local
sample coverage for closed capture–recapture models. Biometrics, habitat variation in the Amazon: large forest plot in eastern
50, 88–97. Ecuador. J. Ecol., 92, 214–229
Legendre, P. & Legendre, L. (1998). Numerical Ecology. Elsevier, Wolda, H. (1981). Similarity indices, sample size and diversity.
Amsterdam. Oecologia, 50, 296–302.
Leigh, E.G., Wright, S.J., Putz, F.E. & Herre, E.A. (1993). The
decline of tree diversity on newly isolated tropical islands: a test
of a null hypothesis and some implications. Evol. Ecol., 7, 76– Editor, Nicholas Gotelli
102. Manuscript received 30 June 2004
Lennon, J.J., Koleff, P., Greenwood, J.J.D. & Gaston, K.J. (2001). First decision made 6 August 2004
The geographical structure of British bird distributions: diversity,
spatial turnover and scale. J. Anim. Ecol., 70, 966–979.
Manuscript accepted 20 October 2004