Charles Darwin and Alfred Russel WallaceEvolution as descent with modification, implying relationships between organisms by unbroken genetic lines Phylogenetics seeks to determine these genetic relationships Alfred Russel Wallace Darwin’s sketch: the first phylogenetic tree? Charles Darwin
Interpretation of morphological characters is often subjective, so open to personal biases Cynodonts (0) Morganuconodonts (1) Eutriconodonts (1) Spalacotheriids (2) Eupantotheres (2) Ji et al. Archaic therians (2) Hu et al. Opalized lower jaw of the monotreme Steropodon Modern therians (2) e.g. Jaw rotation: weak (0), moderate (1), strong (2) as indicated by vertical wear facets on molars. Hu et al. (Nature, 1997) and Ji et al. (Nature, 1999) coded Steropodon (1) and (2) respectively, helping to account for their alternative placements of monotremes
Early Molecular phylogenetics • - Immunological distances • DNA-DNA hybridization • Without access to the actual sequences, these are difficult to apply corrections and statistical significance testing to
Phylogenetics is now dominated by the clearly defined 4 nucleotides and 20 amino acids Purines AG C T Pyrimidines Transitions Transversions Millions of years Hominid phylogeny from DNA
Tree terminology Rooted tree internal edge/branch Unrooted tree external edge/branch node Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 Taxon 7 Taxon 8 internode
outgroup ingroup polyphyly Sister taxa paraphyly polytomy bifurcating
Overview of phylogenetic procedure - by example Biological problem (the question) Which data to obtain (data sampling) Finding the best tree (search strategy) Defining the best tree (optimality criterion)
1. Biological problem (the question) What is the relationship of the extinct American Cheetah (Miracinonyx trumani) to other cats? Two main sister group hypotheses Cheetahs (Acinonyx jubatus): Limb, skull, vertebrae morphology B. Pumas (Felis concolor): Geography, early fossils less cheetah-like See Barnett et al. (Curr. Biol., 2005)
2. Which data to obtain (data sampling) • Mitochondrial (mt) DNA • High mtDNA copy number is important because Ancient DNA is degraded • Inferring relatively recent (2-10 million year) divergences, so substantial sequence variation is required mt control region best < 2 million years mt Protein/RNA coding, best 2 25 million years Observed divergence Nuclear protein-coding, best > 25 million years time
Mitochondrial partial NADH1 alignment for birds #Nexus Begin DATA; Dimensions ntax=29 nchar=10692; Format datatype=dna gap=-; Matrix Tinamou AACTATCTATTCATATCCTTATCATACATCATTCCTATTCTTATTGCA.. Emu AACCATCTCACTATATCACTCTCCTATGCAATCCCCATTCTAATCGCA.. Cassowary AACCACCTCACCATATCCCTGTCCTATGCAATCCCAATTCTAATCGCA.. Kiwi AACTACCTCACTATATCACTATCATATGTCATCCCAATTCTGATTGCA.. Rhea AACTACCTAATTATGTCCCTGTCATATGCTATCCCAATTCTAATCGCA.. Ostrich ACACACCTGACTATAGCACTCTCATACGCTGTTCCAATCCTAATTGCA.. Chicken AACCTTCTAATCATAACCTTATCCTATATTCTCCCCATCCTAATCGCC.. BrushTurkey AAACACCTCATCATATCCCTATCCTATGTTCTCCCAATTTTAATCGCC.. MagpieGoose AATCACCTCATTATAACCCTATCGTATGCCATCCCAATCCTAATCGCC.. Duck AGCTACCTCATTATATCCCTCCTATACGCCATCCCCATTCTAATCGCC.. Broadbill ACTAACCTTACCATATCCCTATCCTACGCCATCCCCGTCCTAGTTGCC.. Flycatcher ACCCACCTCATTATATCACTATCCTATGCCGTACCCATCCTAATTGCT.. ZebraFinch ATTAACCTCATCATAGCCCTCTCCTATGCCCTCCCAATCCTGATCGCA.. Rook GTCAACCTCATTATAGCACTTTCTTATGCTATCCCTATTCTAATCGCC.. Oystercatcher ACCTATCTCATTATATCCCTATCCTATGCCATCCCAATCCTGATCGCA.. Turnstone ACCTACTTCATCATATCCCTATCCTATGCAATCCCAATTCTAATTGCA.. Penguin GCTCACTTAGCCATATCCCTATCCTATGCCATCCCAATCCTCATTGCA.. Albatross ACCTATCTTGTCATGTCCCTATCATATGCCATCCCAATCCTAATCGCC.. ; End;
Tree reconstruction Type of data Distances Discrete (e.g. nucleotides) Information loss often statistical power loss Unweighted pair group method with arithmetic means (UPGMA) Clustering algorithm Neighbour-joining (NJ) Tree-building method Slower Faster Maximum parsimony (MP) Optimality criterion Minimum evolution (ME) Maximum likelihood (ML)
3. Finding the best tree (search strategy) Number of possible trees (where n is the number of taxa) Unrooted trees: (2n-5) (2n-7) …31 Rooted trees: (2n-3) (2n-5) …31 For the 11-taxon cat phylogeny Unrooted = 17 5 13 11 9 7 5 3 1 = 34,459,425 Rooted = Unrooted (2n-3) = 654,729,075 An exhaustive search will examine all trees, but is not practical for n > 12
Reducing the time for searching “tree space” Heuristic search Find an initial tree, and move within near-by tree-space, discarding worse alternatives Only a small amount of tree-space is searched and there is no guarantee of finding the optimal tree - can be trapped in local maxima Global optima X Local optima X X Starting point
Branch and Bound search As trees are built and branches added, if the addition of a taxon to a particular branch results in a tree-length greater than a previously determined upper bound for the tree, then this topology and all those derived from it are ignored and the search continues with a new placement for that taxon Branch and bound guarantees finding globally optimal trees Global optima X Local optima X X Starting point
4. Defining the best tree (optimality criteria) Distance methods Absolute distance matrix 1 2 3 4 5 6 7 8 9 10 11 1 Mongoose - 2 Hyena 156 - 3 Sabretooth 207 147 - 4 Am.Cheetah 192 140 159- 5 Lion 186 134 148 131 - 6 Tiger 160 143 132 111 64 - 7 Puma 194 139 162 70 124 100 - 8 House.Cat 206 133 163 124 118 100 117 - 9 Cheetah 192 139 162 108 127 109 96 110 - 10 Ocelot 206 123 165 116 116 98 111 98 113 - 11 Jaguarundi 204 147 177 123 143 121 101 119 128 131 -
Early phenetics (distance/similarity) studies would note that taxon X and taxon Z are the most similar Taxon Y TCAGCTA Taxon X ACATGTG Taxon Z ACGTCAG XZ= 3 difference YZ= 5 differences XY= 4 differences Taxon X Taxon Z Taxon Y
Cladistic methods, rather than being concerned with similarity, are concerned with the nature of changes (apomorphies) synapomorphy Taxon Y TC A GCTA Taxon X AC A TGTG Taxon Z AC G TCAG Outgroup AA G TCTG autapomorphy symplesiomorphy Synapomorphies are shared derived characters and so are considered to define clades (relationship groupings)
Maximum Parsimony: chooses the tree topology that minimises the number of changes required * Character 3 changes G to A Homoplasy synapomorphy Taxon X Taxon Z * * Taxon Y Taxon X * Taxon Z Taxon Y Outgroup Outgroup 8 step sub-optimal phenetic tree 7 steps (MP tree)
Maximum Likelihood: The explanation that makes the observed outcome the most likely L = Pr(D|H) Probability of the data, given an hypothesis The hypothesis is a tree topology, its branch-lengths and a model under which the data evolved First use in phylogenetics: Cavalli-Sforza and Edwards (1967) for gene frequency data; Felsenstein (1981) for DNA sequences
A A Model of rate change e.g. Kishino-Hasegawa (1985): 4 base frequencies, transition/transversion (ti/tv ratio) 0.5 0.5 substitutions per site 0.6 0.4 0.4 A A A A A A A A A A A A A GC T A G G G A A A A C C G G G G G G G G G G G G A A A A A A A A A A Sum the probabilities for each of the 16 internal node combinations to get the likelihood for this single nucleotide site C T A GC C CT T T G G G G G G G G G G A A A A A A A A A A T A GCT TG G G G G G G G G G G G G G
The likelihood of a tree is the product of the site likelihoods. Taken as natural logs, the site likelihoods can be summed to give the log likelihood: The tree with the highest –lnL is the ML tree • ML is computationally intensive (slow) • If branch-lengths are long, such that substitutions occur multiple times along the same branch for the same site, ML will be more consistent than MP – if the evolutionary process is sufficiently well modelled.
Bayesian Inference: The explanation with the highest posterior probability Prior probability, the probability of the hypothesis on previous knowledge Bayes’ Theorem Likelihood function, probability of the data given the hypothesis Pr(H) Pr(DH) Pr(HD) = Pr(D) Posterior probability, the probability of the hypothesis given the data Unconditional probability of the data, a normalizing constant ensuring the posterior probabilities sum to 1.00 First use in phylogenetics: Li (1996, PhD thesis), Rannala and Yang (1996)
Bayesian inference in phylogenetics is essentially a likelihood method, but may more closely reflect the way humans think. • It is Informed by prior knowledge (e.g. fossil data) • emphasis is placed on Pr(HD) instead of Pr(DH) Markov chain Monte Carlo (MCMC) is used to approximate Bayesian posterior probabilities *(BPP) over 1,000s – 1,000,000s of generations New state rejected New state accepted Tree 1 Tree 2 BPP(tree 1) = 4/6 Tree 3 Generation 1 2 3 4 5 6
Posterior probabilities are integrated over all trees in the posterior distribution – providing density distributions rather than the optimization of likelihood (Flat prior) 0 0.5 1.0 0 0.5 1.0 Prior for a parameter value (e.g. proportion of invariant sites) Posterior for the proportion of invariant sites
The American cheetah is related to the puma - morphological similarity to the cheetah is convergence Mongoose Mongoose Hyena Hyena Sabretooth Sabretooth Am.Cheetah Am.Cheetah American felids Puma Puma Jaguarundi Jaguarundi Cheetah Cheetah Cat Cat Ocelot Ocelot Lion Lion 0.05 substitutions/site Tiger Tiger Maximum parsimony and neighbour-joining (distance) cladogram Maximum likelihood and Bayesian inference phylogram
Applications: The tree of life and inferring our origins
146 gene phylogeny: Delsuc et al. (Nature, 2006) Little evidence from fossils
Identifying selection ACA GAG CGC Threonine - Glutamic acid - Arginine ACG GAG AGC Threonine - Glutamic acid - Serine Decreased dN/dS suggests purifying selection Synonymous (S) non-synonymous (N) substitutions The dN/dS ratio can be estimated along branches of phylogenetic trees (e.g. Guindon et al. PNAS, 2004) Here dN/dS is indicated by branch width Increased dN/dS suggests Positive selection
Cohen (Molec. Biol. Evol., 2002) found increased positive selection at binding sites in the MHC proteins of estuarine fish Fundulus heteroclitus populations subject to severe chemical pollution. Non-synonymous/synonymous ratios for peptide binding regions and non-peptide binding regions MHC (Major histocompatibility complex) binds antigens and presents them to T-cells as part of the immune response. Positive selection at binding sites provides high MHC variability with which to confront new pathogenic threats.
Fish from the Hot spot and Gloucester populations are genetically adapted to severe chemical pollution and show novel patterns of DNA substitution for Mhc class II B locus including strong signals of positive selection at inferred antigen-binding sites Mhc class II B with inferred locations of population-specific amino acid changes for Gloucester and Hot Spot.
Stanhope et al. (Infect. Genet. Evol., 2004) Severe Acute Respiratory Syndrome coronavirus (SARS-CoV) has a recombinant history with lineages of types I and III coronavirus
Using more sophisticated models of sequence evolution, Holmes and Rambaut (Phil. Trans. Roy. Soc. B, 2004) could not reject a single history across the SARS genome I II SARS-TOR2 III Understanding sequence evolution and the biases that may result from models (which necessarily are simplifications) are of vital importance in phylogenetic inference
Host-Parasite coevolution/co-speciation • Etherington et al. (J. Gen Virol, 2006) Carnivoran strains Artiodactyl strains Caliciviruses infect diverse mammalian hosts and include Norovirus, the major cause of food-borne viral gastroenteritis in humans. Host switching by caliciviruses is rare, although pigs have strains from co-speciation (artiodactyl strain) and host switching (carnivoran strain).
Fig (Ficus) and fig wasp mutualism is reflected by co-speciation patterns: Machado et al. (PNAS, 2006)
Most frequent Area cladoragms – mapping taxa onto landmasses Many plants; follows wind dispersal patterns Many land animals: follows continental break-up Africa S. South America Australia midges New Zealand Southern beech Cushion herb Marsupial mammals From: SanMartin and Ronquist (Syst. Biol. 2004)
Conservation genetics : Amur leopard (Panthera pardus orientalis) Relict population of 25-40 individuals in the Russian Far East. • Nuclear microsatellites and mtDNA: Uphyrkina et al. (J. Hered., 2002) • validates subspecies distinctiveness • extreme reduction in genetic diversity in the wild • captive population genetically mixed with the Chinese subspecies
Macroevolutionary inference Cretaceous Tertiary 65 Ma Present Does the 65 Ma meteor impact (Alvarez et al. Science, 1980) fully explain the “great reptile extinction” and the rise of modern birds and mammals?
Molecular clock: DNA/protein divergence between organisms is a function of time K/T boundary 71-68 Ma 144-83 Ma 83-71 Ma 68-65 Ma 95Ma 65Ma
Megafaunal extinctions (human induced or climate change) Macrauchenia Bison (Lascaux, France)
Arrival of humans in North America The distribution of coalescence events over time on the tree allow inference of relative population size Last glacial maximum