360 likes | 399 Vues
Explore approaches and challenges in large-scale phylogenetic inference, including data availability, MCMC and MCMCMC inference methods, pattern-heterogeneity models, and the accumulation of gene sequence data over the years.
E N D
Large Scale Phylogenetic Inference Mark Pagel and Andrew Meade Reading University m.pagel@rdg.ac.uk
Large-Scale Phylogenetic Inference: Approaches and Problems Availability of data Inference from aligned gene sequences: traversing the universe MCMC and MCMCMC inference (assessing the potential for large-scale inference) A model of pattern-heterogeneity suitable for concatenated sequences A Tree of Life n= 4000 species David Hills
The accumulation of gene sequence data • Year No. of Sequences • 1994 215,273 • 2001 14,976,310 • = 70X growth over 7 years • Compare 20% per annum = 3.6X • growth over 7 years
16000000 14000000 12000000 10000000 8000000 6000000 4000000 2000000 0 Cnidaria Bryozoa Chelicerata Ctenophora Arthropods Rotifera rotifers Echinodermata Porifera sponges Priapula priapulans Rhombozoa small Mollusca molluscs Placozoa placozoan Loricifera loriciferans Phorona phoronans Chordata chordates Tardigrada water bears Crustacea crustaceans Onychophora velvet worms Nematoda roundworms Gastrotricha gastrotrichs Cycliophora cycliophoran Sipuncula peanut worms Echiura echiuroid worms Chaetognatha arrow-worms Orthonecta orthonectidans Brachiopoda brachiopods Platyhelminthes flatworms Kinorhyncha mud-dragons Pentastoma tongue worms Pogonophora beard worms Hemichordata hemichordates Nemertea nemertine worms Annelida segmented worms Symplasma glass sponges Nematomorpha nematomorphs Gnathostomula gnathostomulans Acanthocephala thorny-headed worms Entoprocta entoprocts or kamptozoans Numbers of gene sequences for metazoan phyla
Source: GenBank All nucleotide sequences
Large-Scale Phylogenetic Inference: Approaches and Problems Availability of data Inference from aligned gene sequences: traversing the universe MCMC and MCMCMC inference (assessing the potential for large-scale inference) A model of pattern-heterogeneity suitable for concatenated sequences
No. of Trees No. of tips (species) Number of Possible Phylogenetic Trees Species Unrooted Rooted N=50 No. rooted = 27529213532835651545259729751524430639300973035816196098326553772152587890625 No. unrooted = 283806325080779912837729172696128150920628587998105114415737667754150390625
Sampling the Universe of Phylogenetic Trees • Markov-Chain Monte Carlo (MCMC) Methods • Generate a large number of phylogenetic trees from a Markov Chain • at equilibrium randomly sample from universe of trees sampling mechanism: The Metropolis-Hastings Algorithm Accept new tree with p=1.0 if L(Tn+1) > L(Tn) otherwise… accept with probability L(Tn+1)/ L(Tn)
Sampling the universe of possible trees: Markov-chain Monte Carlo methods
Long Interspersed Nuclear Elements -- LINEs --autonomously replicating retrotransposons endonuclease reverse transcriptase 3’ 5’ ~6000 bases -- as old as mammals (at least) --20-40 active elements --500,000-1,000,000 ‘fossil’ fragments --account for ~20% of nucleotide content of human genome
Phylogenetic tree of LINE’s in the Human genome n=500 Sampled from Markov Chain
Convergence of a Markov chainsampling phylogenetic tree of n=500 tips using an alignment of n=4400 nucleotides log-likelihood Iteration number NB: 99% of increase in likelihood in first 2.8% of run. 0.07% change in final 2 million iterations
Frequency histogram of log-likelihoods for phylogenetic trees of n=500 LINEs in Human genome (alignment = 4000 bp). Note: unconverged chain. n= 1000 trees Mean -700299.7 Std. Dev. 15.91 n=1000 log-likelihood
Metropolis-Coupled Markov Chain Monte Carlo (MCMCMC) Given m simultaneous Markov chains, swap states each iteration among a randomly chosen pair i and j according to: xixj xk {likelihood ratio chain i * likelihood ratio chain j} yiyj yk
‘Temperatures’ of heated chains cold chain t=0.2 1/(1+t(i-1) t=0.5 1/i number of chains, i
possum c22 b6 c21 b4 c22 b5 c21 b5 c6 b5 c21 b3 c22 b3 c6 b4 c1 b3 c6 b2 c22 b2 c6 b3 c22 b1 c1 b2 c1 b1 mouse3 C1.20 C1.18 c21 b1 c21 b2 C6.20 C21.19R C22.18 C1.19 C21.20 C6.15 C22.13 C22.14 C6.19 C1.17 C6.18 C21.17R c6 b1 C22.15 C6.16 C22.17 C21.15R C1.16 C1.15R C6.17 C21.16 C21.18 C21.12 C6.12 C22.11 C21.14 C1.13 C1.14 C1.10 C21.9 C1.12 C21.10 C6.11 C1.11R C6.10 C22.8 C6.8 C21.6 C1.9 C21.8 C21.7 C6.9R C22.7 C1.7 C6.7 C1.8 C22.6 C21.5 C21.4 C1.6 C22.5 C6.6 0.1 C22.4 C1.5 C1.4 C22.2 C21.3 C22.3 C6.5 C6.4 C6.3R C1.3 C6.2 C1.2 C21.2 C1.1 C21.1 gorilla L1 B-globi C22.1 C6.1 Phylogeny of Human LINE-1 elements (92 elements, 4kb sequences) ~10-15 ~90 millions of years ago ~120
LINEs data (truncated alignment). Simultaneous chains with heating and swapping log-likelihood Chain swapping cold chain hot chain warm chain generation
45 40 35 30 25 Count 20 15 10 5 0 -37820 -37810 -37800 -37790 -37780 -37770 -37760 LINEs data Log-likelihoods of trees from cold chain (‘converged’ chain) Log-likelihoods pre-swap trees post-swap trees
Large-Scale Phylogenetic Inference: Approaches and Problems Availability of data Inference from aligned gene sequences: traversing the universe MCMC and MCMCMC inference (assessing the potential for large-scale inference) A model of pattern-heterogeneity suitable for concatenated sequences
Pattern-Heterogeneity Model of Gene-Sequence Evolution Allow for different genes in a single concatenated alignment or different regions of the same gene to evolve in qualitatively different ways Contrast rate heterogeneity: can only detect difference in rates Implement pattern-heterogeneity without partitioning data P-H will always equal or better the performance of gamma rate heterogeneity model. Normally yields substantial improvements (100’s of log-units) Applications Detecting regions of genes that evolve differently Large-scale inference: suitable for concatenated gene sequences (e.g. recent phylogeny of the mammals was based upon 16,000 nucleotides and 16 genes), or “supermatrix” alignments
Concatenated gene alignment gene 1 gene 2 gene 3 gene k species 1 species 2 species n . . . “Supermatrix” alignment species 1 species 2 species n-k species n Applications of pattern-heterogeneity model Single gene alignment species 1 species n pattern 2 pattern 1 pattern 3
'Oceanodroma_hornbyi' 0001000000000000011010 'Gavia_stellata' 0000000000000000000110 'Gavia_immer' 0000000000000000000110 'Spheniscus_demersus' 1110000000000000000001 'Pygoscelis_adeliae' 00000000000001 'Eudyptula_minor' 010000000000011110000000000000000001 'Eudyptes_pachyrhynchus' 11000000000001 'Megadyptes_antipodes' 110000000000010110000000000000000001 'Fregetta_grallaria' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010 'Pygoscelis_antarctica' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001 'Pygoscelis_papua' 00100000000000000000000000000000000000000000000000000000000000000000000000000000000001 0010000000000000000001 'Eudyptes_chrysolophus' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001 'Eudyptes_chrysocome' 11000000000000000000000000000000000000000000000000000000000000000000000000000000000001 'Aptenodytes_patagonicus' 01000000000000000000000000000000000000000000000000000000000000000000000000000000000001 0000000000000000000001 'Oceanodroma_melania' 00000001000000000000000000000000000000000000000000000000000000000000000000000000000110 'Oceanodroma_tethys' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110 'Halocyptena_microsoma' 00010001000000000000000000000000000000000000000000000000000000000000000000000000000110 'Oceanodroma_furcata' 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010 'Oceanodroma_tristrami' 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110 'Oceanites_oceanicus' 00000000000000000000000000000000000000000000000000000000000000000000000000000000011010 0000000000000000011010 'Fregetta_tropica' 00000000101000000000000000000000000000000000000000000000000000000000000000000000011010 'Garrodia_nereis' 00000000011000000000000000000000000000000000000000000000000000000000000000000000011010 'Pelagodroma_marina' 0000000001100000000000000000000000000000000000000000000000000000000000000000000001101000000000000110 'Pelecanoides_garnotii' 00000000000000000000000000000000000000000000000000000000000000000000000000000110101010 'Pelecanoides_magellani' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010 'Pelecanoides_georgicus' 00000000000000000000000000000001100000000000000000000000000000000000000000000110101010000000001110100000000000000110101010 'Lugensa_brevirostris' 00000000000000000000000000000000000000000000000000000000000000000000001010101010101010 0000000001011010101010 'Calonectris_leucomelas' 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010 'Puffinus_opisthomelas' 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010 'Procellaria_westlandica' 0000000000000000000000000000000000000000000000000000010000100000000000011010101010101000000101011010 'Procellaria_parkinsoni' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010 'Procellaria_aequinoctialis' 00000000000000000000000000000000000000000000000000001100001000000000000110101010101010 'Pachyptila_turtur' 0000000000000000000000000000000000000000000000000000000011000000000000011010101010101000010000111010 'Pachyptila_desolata' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010 'Pachyptila_salvini' 00000000000000000000000000000000000000000000000000000011110000000000000110101010101010 'Pachyptila_vittata' 00000000000000000000000000000000000000000000000000000001110000000000000110101010101010000100001110100000000000111010101010 'Halobaena_caerulea' 00000000000000000000000000000000000000000000000000000000010000000000000110101010101010 'Thalassoica_antarctica' 00000000000000000000000000000000000000000000000000010000000000000000000001101010101010 0000000000111010101010 'Daption_capense' 0000000000000000000000000000000000000000000000000011000000000000000000000110101010101000000000001010 'Macronectes_halli' 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010 'Phoebastria_irrorata' 00000000000000000001000000000010000000000000000000000000000000000000000000000001101010 0000011000000001101010 'Phoebastria_nigripes' 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000001000000001101010 'Diomedea_sanfordi' 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010 'Diomedea_dabbenena' 00000000000000001010000000000010000000000000000000000000000000000000000000000001101010 'Diomedea_antipodensis' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010 'Diomedea_gibsoni' 00000000000001011010000000000010000000000000000000000000000000000000000000000001101010 'Thalassarche_impavida' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010 'Thalassarche_melanophris' 00000000000000000000100011010100000000000000000000000000000000000000000000000001101010 'Thalassarche_salvini' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010 'Thalassarche_eremita' 00000000000000000000011101010100000000000000000000000000000000000000000000000001101010 'Thalassarche_cauta' 00000000000000000000001101010100000000000000000000000000000000000000000000000001101010 0000000000000001101010 'Thalassarche_bassi' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010 'Thalassarche_chlororhynchos' 00000000000000000000000000110100000000000000000000000000000000000000000000000001101010 'Pterodroma_axillaris' 0000000000000000000000000001 'Pterodroma_cervicalis' 1000000000000000000000000001 'Pterodroma_hypoleuca' 000000000000000000000000001000000000000000000000000000000000000000000000011000000000000000000000000000011010101010 0000000011011010101010 'Pterodroma_defilippiana' 0111000000000000000000001110 'Pterodroma_cookii' 011100000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010000000110110100000000011011010101010 'Pterodroma_leucoptera' 0011000000000000000000001110 'Pterodroma_brevipes' 0001000000000000000000001110 'Pterodroma_longirostris' 000010000000000000000000111000000000000000000000000000000000010000000000011000000000000000000000000000011010101010 'Pterodroma_pycrofti' 0000100000000000000000001110 'Pterodroma_inexpectata' 00000000000000000000010101100000000000000000000000000000000000010000000110100000000000000000000000000001101010101000000011011010 'Pterodroma_ultima' 0000000000000000000011010110 'Pterodroma_solandri' 0000000000000000000111010110 'Pterodroma_macroptera' 000000000000000000111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010 'Pterodroma_magentae' 000000000000000001111101011000000000000000000000000000000000000000010110101000000000000000000000000000011010101010 'Pterodroma_lessonii' 000000000000001011111101011000000000000000000000000000000000000001110110101000000000000000000000000000011010101010 'Pterodroma_incerta' 000000000000011011111101011000000000000000000000000000000000000000110110101000000000000000000000000000011010101010 'Pterodroma_hasitata' 000000000000111011111101011000000000000000000000000000000000000000001110101000000000000000000000000000011010101010 0000000011011010101010 'Pterodroma_cahow' 000000000000111011111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010 'Pterodroma_mollis' 000000000000000111111101011000000000000000000000000000000000000000000010101000000000000000000000000000011010101010 'Pterodroma_madeira' 0000000000010001111111010110 'Pterodroma_feae' 000000000001000111111101011000000000000000000000000000000000000010001110101000000000000000000000000000011010101010 'Pterodroma_alba' 0000000000000000000000110110 'Pterodroma_heraldica' 0000000001100000000000110110 'Pterodroma_sandwichensis' 0000010001100000000000110110 'Pterodroma_phaeopygia' 000001000110000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010 'Pterodroma_neglecta' 000000101010000000000011011000000000000000000000000000000000000000000001101000000000000000000000000000011010101010 'Pterodroma_externa' 000000101010000000000011011000000000000000000000000000000000001100000001101000000000000000000000000000011010101010 'Pterodroma_arminjoniana' 0000000110100000000000110110 0000000011011010101010 'Diomedea_epomophora' 001010000000000110 00000000000000000110000000000010000000000000000000000000000000000000000000000001101010001000000001100000111000000001101010 'Diomedea_amsterdamensis' 001010000000000110 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010 'Phoebastria_immutabilis' 000110000000000110 00000000000110000001000000000010000000000000000000000000000000000000000000000001101010 0000111000000001101010 'Phoebastria_albatrus' 000110000000000110 00000000000010000001000000000010000000000000000000000000000000000000000000000001101010 'Phoebetria_palpebrata' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010 'Phoebetria_fusca' 100001000000000110 00000000000000000000000000001100000000000000000000000000000000000000000000000001101010 'Thalassarche_chrysostoma' 010001000000000110 00000000000000000000000011010100000000000000000000000000000000000000000000000001101010 0000000000000001101010 'Thalassarche_bulleri' 010001000000000110 00000000000000000000000101010100000000000000000000000000000000000000000000000001101010001000000001100000000000000001101010 'Fulmarus_glacialoides' 000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010 'Hydrobates_pelagicus' 000000000000000001 00001010000000000000000000000000000000000000000000000000000000000000000000000000000110 'Oceanodroma_castro' 000000000000000001 'Pterodroma_baraui' 000000001110 0000000110100000000000110110 'Pagodroma_nivea' 000000110110 00000000000000000000000000000000000000000000000000000000000000000000000001101010101010 'Procellaria_cinerea' 000011010110000000000001101010 'Pseudobulweria_rostrata' 010101010110 'Pseudobulweria_aterrima' 010101010110 'Pterodroma_nigripennis' 000000001110 100000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000011010101010 'Macronectes_giganteus' 100000110110000000000000011010 00000000000000000000000000000000000000000000000101110000000000000000000001101010101010 0000000000001010101010 'Calonectris_diomedea' 001101010110000000000011101010 00000000000000000000000000000000000000000000000000000000000000000011011010101010101010 0000000100111010101010 'Bulweria_bulwerii' 000011010110000000000000101010 00000000000000000000000000000000000000000000000000000000001000000000000110101010101010 'Pelecanoides_urinatrix' 000000000010 00000000000000000000000000000000100000000000000000000000000000000000000000000110101010 0000000000000110101010 'Oceanodroma_leucorhoa' 000000000001 00000110000000000000000000000000000000000000000000000000000000000000000000000000000110 0001000000000000011010 'Diomedea_exulans' 000000000001 00000000000000111010000000000010000000000000000000000000000000000000000000000001101010 'Fulmarus_glacialis' 00000000100000110110000000100000011010 00000000000000000000000000000000000000000000000011110000000000000000000001101010101010 'Puffinus_creatopus' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010 'Puffinus_carneipes' 10000011 00000000000000000000000000000000000000000000000000000000000110000000111010101010101010 'Puffinus_gravis' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010 'Puffinus_griseus' 00000011 00000000000000000000000000000000000000000000000000000000000010000000111010101010101010000011010110100000000100111010101010 'Puffinus_tenuirostris' 00000011 'Puffinus_bulleri' 01000011 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010 'Puffinus_pacificus' 01000011001101010110 00000000000000000000000000000000000000000000000000000000000001000000111010101010101010 'Puffinus_nativitatis' 00000101 00000000000000000000000000000000000000000000000000000000000000000101011010101010101010 'Puffinus_mauretanicus' 00101101 000000011111101010 'Puffinus_yelkouan' 00101101 000000011111101010 'Puffinus_gavia' 00011101 'Puffinus_huttoni' 00011101 0000000000000000000000000000000000000000000000000000000000000000110101101010101010101000001101011010 'Puffinus_assimilis' 00001101 000000000111101010 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010 'Puffinus_lherminieri' 00001101 00000000000000000000000000000000000000000000000000000000000000111101011010101010101010 'Puffinus_auricularis' 00001101 'Puffinus_puffinus' 00001101 000000001111101010 00000000000000000000000000000000000000000000000000000000000000011101011010101010101010
Testing the Pattern Heterogeneity Model : two different rate matrices Generate data on a known tree according to these two matrices and form a concatenated alignment. ‘gene 1’ = 600 bases ‘gene 2’ = 400 bases
log-likelihoods obtained from three models applied to simulated pattern-heterogeneity data
log-likelihoods by site in the simulated pattern-heterogeneity data
Pattern-heterogeneity model: Simulated and obtained values of the rate parameters
log-likelihoods for combined LSU/SSU nrRNA data set: 54 species n=800 sites
log-likelihoods by site in the LSU/SSU combined data set The divide between the two genes
Log-likelihoods for cytochrome-b data set. N=433 sites of which 300 are fixed for a single nucleotide
Metropolis-Hastings Algorithm: Accept new tree according to Likelihood ratio prior ratio proposal ratio X=data (e.g., gene sequences) T=tree (topology, branches, parameters)