530 likes | 647 Vues
DNA Barcoding the Right Way: A Theory-based Method for Species Detection and Identification. Bill Birky Department of Ecology and Evolutionary Biology The University of Arizona.
E N D
DNA Barcoding the Right Way: A Theory-based Method for Species Detection and Identification Bill Birky Department of Ecology and Evolutionary Biology The University of Arizona
My goal is to understand a remarkable and general feature of nature: that the diversity of organisms does not present to us as a continuum but as more or less distinct clusters of individuals with different phenotypes that we call species. Biological Diversity is Discontinuous
Why We Should Care About Species • Species are treated as fundamental units of biological diversity in areas of biology including • systematics • conservation • population genetics • evolutionary biology • biogeography • any research paper where we need to specify the experimental organism(s) • How we define species and distinguish one from another really matters.
Darwin’s Conflicted Views of Species1868 The Origin of Species 5th edition, p. 415 “Hereafter we shall be compelled to acknowledge that the only distinction between species and well-marked varieties is, that the latter are known, or believed to be connected at the present day by intermediate gradations, whereas species were formerly thus connected. Hence, without rejecting the consideration of the present existence of intermediate gradations between any two forms, we shall be led to weigh more carefully and to value higher the actual amount of difference between them.…” At this point Darwin had got it right. In this talk I will follow his advice and weigh more carefully the actual amount of difference between species, relative to the differences within species. If only Darwin had stopped here, but he didn’t…
Fast-forward to 2011… The good news: We now have a proliferation of models of what species are (theoretical/conceptual definitions, often called species concepts) and analytic tools to assign individuals to species (operational definitions, often called species criteria). DNA sequences provide powerful tools for systematics. The bad news: We have a proliferation of models and operational definitions. There is a state of “…warfare’ among adherents to different systematic doctrines...and …astonishingly combative language and behavior of some partisans.”. (Doug Futuyma) Some biologists believe that species aren’t real. Systematics is laissez faire when it comes to publishing actual species descriptions. Most such papers make no mention of species concepts or operational definitions. My approach to delimiting Eukaryotic species…
Darwin: the Gap’s the Thing sort Gap in Phenotypes Cluster of similar phenotypes Cluster of similar phenotypes
The Gap’s the Thing…But How Big a Gap? Gap in Phenotypes Gap in Phenotypes Gap in Phenotypes Can be addressed using very sophisticated morphometric, physiological, or behavioral analyses but this is much too time-consuming for routine use… and it is no help with environmental sequences.
Clades in Phylogenetic Trees of DNA Sequences Often Reflect Phenotypic Clusters That We See What we see… phenotypic gap What we infer from sequences… genotypic gap
Species Clusters in Phylogenetic Trees of DNA Sequences Reflect Phenotypic Clusters That We See…But Also Detect Clusters That We Can’t See What we see What we infer from sequences But does this sequence gap separate species, or just varieties within a species?
Causes of Gaps and Clusters in DNA Sequence Trees Accidental variation in the numbers of offspring (random drift) produces transient, shallow gaps and clusters of average depth 2Ne generations. Physical isolation, reproductive isolation, or adaptation to different niches produces deep gaps and clusters of mean depth > 2Ne generations.
The Evolutionary Genetic Species Model This led me to the Evolutionary Genetic Species Model (EGSM): Evolutionary Genetic Species are inclusive populations that can be shown to be evolving independently from each other. They are independent arenas for mutation, selection, and random genetic drift. Their independence can be the result of adaptation to different niches, or physically isolation, or both. [Erratum: or reproductive isolation.] This is a variant of the Evolutionary Species Concept that is (1) explicitly genetic so we can use it with DNA sequences; (2) does not require the species be adapted to different niches; and (3) does not require knowing that independence is permanent. Note that it is often difficult to tell whether two populations are evolving independently due to niche divergence or physical isolation.
A Species Criterion or Operational Definition The Evolutionary Genetic Species Model is a conceptual definition; it needs an operational definition or species criterion to say whether two or more individuals belong to one species or to two or more. This can be done in a number of ways. For example: in sexual organisms, using the Biological Species definition and testing individuals for reproductive isolation by trial matings or indirect inference from population genetic data, morphology, etc. I am focusing on DNA sequence data.
We Can Use Genes to Delimit EG Species, but Which Gene(s) Should We Use? Ideal: gene responsible for reproductive isolation or adaptation to different niches or first gene to complete coalescence after physical isolation. Gene responsible for isolation completes lineage sorting when isolation is complete. Usually a nuclear gene(s). Problem: we rarely know what this gene is. Never know with environmental sequences.
What Gene Should We Use? Other nuclear genes: in sexual organism, different genes sort at different times by chance, ranging from about the time of speciation through average of 2Ne ≈ 4Nf and higher. Second best: organelle gene (mitochondrial or chloroplast). Inherited uniparentally (usually maternally in animals and plants), and effectively haploid. Therefore effective population size is ≈ Nf. This is 1/4 the effective size for nuclear genes, so completes lineage sorting 4 times faster. Organelle genes detect speciation earlier.
What Gene Should We Use? Organelle genes have other practical advantages: All copies of a gene in a cell or organism are identical at most sites. Consequently one can PCR-amplify an organelle gene and sequence the amplification products directly without cloning. The mitochondrial “barcode” gene (cox1 or CO1) can be amplified from most animals with a universal pair of primers and has an ideal amount of diversity for identifying species.
Gaps: How Deep Is Deep Enough? We have a gene that can detect early stages of speciation. A tree of such a gene should show a gap between species. But how deep must it be to distinguish gaps between species from gaps between clades within species? ? ? ? ?
Gaps: How Deep is Deep Enoughto Differentiate Species? This is the question I will answer by calculating probabilities that the specimens came from independently evolving populations. P = 0.5 P = 0.81 My favorite cutoff is 95%, so the probability of single species is ≤ 5%. P = 0.98 Note: this is a purely hypothetical case, probabilities are very rough approximations. P > 0.99
Gaps: How Deep is Deep Enoughto Differentiate Species? We need something to compare the between-clade distances to. Solution: compare to within-clade distances. We can get the probabilities from the ratio of sequence difference between two sister clades (K) to the mean sequence difference within the clades (q). K = f(t,u) Express t in units of Ne generations: K = f(Ne,u)q= f(Ne,u) Therefore K/q is dimensionless because Ne and u cancel. Good because Ne and u are usually unknown! • = average q = p/(1-4p/3) t K = average +
One More Problem: Sampling We do not see the tree I showed you earlier (A) because most lineages are extinct. Tree B is the phylogeny of the surviving individuals. But we don’t even see all of the survivors. We make a tree (C) based on a very small sample of individuals from an immense population. A B C
The problem is to use this to infer this and then define species. We have to distinguish between gaps and clusters formed by random drift, and gaps and clusters formed by physical isolation or adaptation to different niches or reproductive isolation. And we must do this based on very small samples of very large populations of the few survivors of evolution.
Fortunately, Noah Rosenberg showed how one can calculate the probability that two populations are reciprocally monophyletic (and therefore have been evolving independently), given that the samples are reciprocally monophyletic and we know the ratio K/q. (Rosenberg 2003 Evolution 57:1465)
Conceptual and Operational Definitions of Species Now we have a conceptual definition or model of species, the EGSM, and an operation definition or species criterion using K/q. In fact K/ q together with the sample sizes tells us the probability that a sample includes specimens from two species. Briefly: Make a bootstrapped distance tree of DNA sequences from the specimens to identify robust clades. Get the pairwise sequence differences between the specimens. Starting at the tip of the tree, find pairs of well-supported sister clades and for each pair calculate K/q. Use Noah Rosenberg’s table with K/q and the sample sizes to get the probability that that the samples came from independently evolving populations, i.e. from different species. Going toward the root of the tree, repeat until species are found.
First Applied K/q to Delimit Species in Asexual Organisms Birky et al. 2010 PLoS One 5(5):1-11 Oribatid mites Nothrus, Platynothrus Birky and Barraclough 2010 Bdelloid rotifers Birky et al. 2005 Birky and Barraclough 2010 Oligochaete Lumbriculus variegatus Fungus Penicillium Heterotrophic marine flagellates Green alga Ostreococcus
Some of Mike Robeson’s Soil Bdelloids3 Cases Involving Singlets K/ = 7.3 n1, n2 = 8,1 P > 0.98 K/ = 3.97 n1, n2 = 3,1 P = 0.94 K/ = 2.6 n1, n2 = 21,1 P ≈ 0.84
I published paper with Tim Barraclough and Austin Burt showing that asexual organisms can undergo speciation, without using the word “species”. Only later discovered that Austin wasn’t sure that species are real. I just realized that this might have some advantages… If species aren’t real, then they can’t go extinct. We don’t need the Endangered Species Act.
Another Ancient Asexual Organism First Application to Delimit Species in Sexual Organisms Darwinulid ostracods Schön, Pinto, Halse, Martens, Birky (in preparation) Copepod Hemidiaptomis Federico Marrone et al. 2010
Applying K/q Method To Sexual Organisms We require data in which cox1 or another organelle gene has been amplified from a sample of individuals, sequenced in both directions to minimize sequencing errors, and sequences trimmed to same length to avoid comparing apples and oranges.
Example 1: Pterapod (Sea Butterfly) Limacina helicina (Hunt et al. 2010 Poles apart: the “Bipolar” pterapod species Limacina helicina is genetically distinct between the Arctic and Antarctic Oceans. PLoS ONE 5:e9835.) Phylogenetic tree of cox1 sequences shows that north and south circumpolar populations form well-supported clades. Hunt et al. proposed that these represented different species. I verified this, using K/q to show that these are different evolutionary genetic species.
Implementation of K/q Ratio Test Align and proofread sequences, trim to same length, remove gaps, etc. Make Neighbor-joining (NJ) and bootstrapped NJ trees to identify pairs of sister clades with robust support which are candidates for EG species. Make matrix of pairwise sequence differences and calculate K/q for candidates. Or better, get some or all of this information from other people.
Implementation of K/q Ratio Test (cont.) 4. In Noah Rosenberg’s table, look up K/q (TA or TB; only goes as high as 5 in the table) and sample sizes (rA, rB; here, 6, 5) and read probability that the populations from which the samples came are reciprocally monophyletic and evolving independently: P > 0.991675 Part of the table: Important caveat:The probability assumes that the samples are representative of the entire population. This can be tested, for example, by showing that increasing the number and variety of sample locations doesn’t change the conclusions.
Example 2: Ravens Chihuahuan Raven Corvus cryptoleucos Common Raven Corvus corax
Ravens (cont.) Omland et al. (2000 Proc. R. Soc. Lond. B 267:2475; 2006 Molec. Ecol. 15:795): mitochondrial and nuclear DNA sequences show three clades: Chihuahuan Ravens; Common Ravens from Europe, Asia, and most of the U.S.; and most Common Ravens from the Pacific Coast. Pacific Coast ravens
Ravens (cont.) • I downloaded all 101 sequences of the raven mitochondrial cob • gene from GenBank, plus outgroups. • Same procedure as with Pterapods: Sequences were aligned (one sequence was deleted because it could not be aligned). Trimmed sequences to 258 bp consisting of 76 complete codons (except one was missing 1 bp at 5’ end and one was missing 1 bp at 3’ end). Made Neighbor-joining trees with and without bootstrapping to identify sister clades. Calculated all pairwise sequence differences in PAUP*. All ingroup sequence differences were ≤ 0.06, so I made no corrections for multiple hits.
Ravens (cont.) Chihuahuan Raven Common Raven-California Results verify three species: Common Raven-California vs. Chihuahuan Raven Using q from Chihuahuan: K/q = 2.34 n1, n2 = 17, 7 P = 0.93 (conservative) Using q from Common-California: K/q = 15.0 n1, n2 = 17, 7 P > 0.995 Common Raven-California vs. Common Raven-Holarctic K/q = 32.6 n1, n2 = 75, 17 P > 0.995 Common Raven-Holarctic
Example 3: LiverwortFrullania tamarisci (Scalewort) Jochen Heinrichs et al. 2010 One species or at least eight? Delimitation and distribution of Frullania tamarisci (L.) Dumort s. l. (Jugermanniopsida, Porellales) inferred from nuclear and chloroplast DNA markers. Mol. Phylogenet. Evol. 56:1105-1114. I obtained the sequences from Jochen Heinrichs and edited them: 1. Deleted taxa except for the clade identified as Frullania tanarisci sensu lato by Heinrichs et al. 2. Removed nuclear genes, leaving concatenated chloroplast genes trnL-F + atpB-rbcL. 3. Trimmed these to ca. same length and removed most gaps. 4. Made Neighbor-joining tree and bootstrapped NJ tree to identify well-supported clades.
Liverwort(cont.) I used K/q to verify Heinrichs’ conclusion that F. tamarisci is a complex of species, and to show that two singlets and their sister clades are probably samples from different species.
Example 4: Clouded Leopard Kitchener et al., 2006 Four subspecies are actually two species (grey and black) based on phenotypes.
Clouded Leopard (cont.) Buckley-Beason et al. 2006: NJ K2P tree of 771 bp of mtDNA verifies species based on reciprocal monophyly and deep divergences. By inspection, K/q ≥ 4 and P(2 species) ≥ 0.95.
Marine Enchytraeid Oligochaete Grania In mitochondrial cox1 tree the established species formed well-supported clades separated from each other by deep gaps, judged by the authors to show absence of gene flow “in a long time” despite some of the species being sympatric. E.g. one specimen was judged to be well-separated its sister clade and, despite being morphologically identical to G. postclitellachaeta, was described as a new species, G. occulta. Examination of the cox1 tree showed that these clades have a sufficiently large K/q ratio to easily qualify as EG species. PDW15 vs. other G. postclitellochaeta may also be distinct species (open circle). De Wit & Erséus 2010 “Genetic variation and phylogeny of Scandinavian species of Grania (Annelida: Clitellata: Enchytraeidae), with the discovery of a cryptic species.” J. Zool Syst. Evol. Res. 48:285
Grania (cont.) De Wit & Erséus 2010 J. Zool Syst. Evol. Res. 48:285 Previously described species verified by K/q New species, verified by K/q Other K/q species? The K/q ratio should be used to determine the probability that the yellow starred specimens represent new species. Authors didn’t consider these for species status because the nuclear ITS sequence didn’t separate them from sister clade, but it’s not surprising that nuclear genes would segregate later.
Potential Problems/Limitations of K/q Method 1. Problem of female philopatry, noted by Weisrock et al. (2010) for lemurs: The use of mitochondrial or chloroplast genes will be misleading if two populations have no female migration, but male migration continues. Then the two populations will be assigned to different species by the K/q ratio of mito genes but males will carry nuclear genes between the populations and prevent independent evolution. When this is suspected, it might be appropriate to use both an organelle gene and a nuclear gene to track males. 2. Because coalescence is a stochastic process, a small proportion of nuclear genes are expected to achieve reciprocal monophyly before organelle genes. Unfortunately it is impossible to identify those genes in advance, and it would be very difficult to identify them after the fact.
Potential Problems/Limitations of K/q Method (cont.) 3. It bears repeating that the probability assumes that the samples are representative of the entire population. This can be tested as I did for the bdelloid rotifers, by showing that increasing the number of sample locations, the number of samples per site, and the number of individuals in the sample doesn’t change the conclusions. Increasing the sample coverage did not split or lump species found with smaller samples. But when K/q is large or q is in the usual range for the group of organisms, it is unlikely that additional sampling will increase q enough to reduce the ratio significantly.
Barcode Gap As two populations diverge, a frequency distribution of the pairwise differences among their sequences becomes bimodal: one peak for differences within species, the other for differences between species. The gap between the peaks is sometimes called the “ barcode gap”. Sea butterfly example: 20 10 0 Barcode gap No. pairs # pairs 0-1 1.1-2 2.1-3 ……………………….. .32 33 34 35 36 Percent sequence difference Pairwise differences
Barcode Gap (cont.) No. pairs Sequence difference Used by Consortium for the Barcode of Life (CBOL) and the International Barcode of Life project (iBOL) to identify gaps between sequences from already-described species. Critics of barcoding point to cases where gap fails to distinguish species, or splits a species, as failures of barcoding. But: Assumes species defined by systematists are real species. So systematists are the only people who never make misteaks? Sets barcoding up for failure.
Barcode Gap (cont.) 1. Problem: assumes species defined by systematists are real species. So systematists are the only people who never make misteaks? 2. Critics of barcoding point to cases where gap fails to distinguish species, or splits a species, as failures of barcoding. But when data from more than two species are pooled, the gap can disappear if the different species pairs have different diversities. Testing barcoding by looking for a gap in data pooled from many species sets it up for failure. + + =
Barcode Gap (cont.) 1. Problem: assumes species defined by systematists are real species. So systematists are the only people who never make misteaks? 2. Critics of barcoding point to cases where gap fails to distinguish species, or splits a species, as failures of barcoding. But when data from more than two species are pooled, the gap can disappear if the different species pairs have different diversities. Testing barcoding by looking for a gap in data pooled from many species sets it up for failure. 3. As practiced by CBOL/IBOL, barcoding has no theoretical rationale. Using the evolutionary species concept or my version of it and the K/q ratio to delimit species would solve these problems.
The K/q Ratio Is Not Exclusive • Use of the K/q ratio does not preclude the use of other methods to test whether a sample includes specimens from ≥ 2 evolutionary genetic species. For example: • If one could show that the specimens fell into groups that could mate only with members of the same group, this is evidence that the sample includes members of two different species even if they are sympatric. • If individuals in a sample came from one or the other of two well-separated geographic locations and there was no migration between them, this is evidence that the populations in those regions would be evolving independently and so are different species. • Note that the sampling problem still exists…statistical analysis is needed! • Finding species by using DNA sequences is not the end of taxonomy! Whenever it is practical, species found in this way should be studied to find morphological traits that distinguish them reliably. Just as in traditional systematics, the behavior, ecology, and distribution of the species should be studied.