1 / 31

Using Divide-and-Conquer to Construct the Tree of Life

Using Divide-and-Conquer to Construct the Tree of Life. Tandy Warnow University of Illinois at Urbana-Champaign. Phylogeny (evolutionary tree). From the Tree of the Life Website, University of Arizona. Two Two dimensions: number of genes and number of species. Phylogenomic pipeline.

Télécharger la présentation

Using Divide-and-Conquer to Construct the Tree of Life

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign

  2. Phylogeny (evolutionary tree) From the Tree of the Life Website,University of Arizona

  3. Two Two dimensions: number of genes and number of species

  4. Phylogenomic pipeline • Select taxon set and markers • Gather and screen sequence data, possibly identify orthologs • Compute multiple sequence alignments for each locus, and construct gene trees • Compute species tree or network: • Combine the estimated gene trees, OR • Estimate a tree from a concatenation of the multiple sequence alignments • Get statistical support on each branch (e.g., bootstrapping) • Estimate dates on the nodes of the phylogeny • Use species tree with branch support and dates to understand biology

  5. Phylogenomic pipeline • Select taxon set and markers • Gather and screen sequence data, possibly identify orthologs • Compute multiple sequence alignments for each locus, and construct gene trees • Compute species tree or network: • Combine the estimated gene trees, OR • Estimate a tree from a concatenation of the multiple sequence alignments • Get statistical support on each branch (e.g., bootstrapping) • Estimate dates on the nodes of the phylogeny • Use species tree with branch support and dates to understand biology

  6. 1KP: Thousand Transcriptome Project T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin N. Matasci iPlant J. Leebens-Mack U Georgia N. Wickett Northwestern G. Ka-Shu Wong U Alberta • First publication: Wickett, Mirarab, et al., PNAS, 2014 • Used SATé (Liu et al., Science 2009 and Syst Biol 2012) to compute multiple sequence alignments and trees • Used ASTRAL (Mirarab et al., Bioinf 2014 and 2015) to compute the species tree • Upcoming Challenge: • Multiple sequence alignment and gene tree estimation on 100,000 sequences. • Many sequences are highly fragmentary.

  7. Multiple Sequence Alignment (MSA): a scientific grand challenge1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

  8. Divide-and-Conquer • Divide-and-conquer is a basic algorithmic trick for solving problems! • Three steps: • divide a dataset into two or more sets, • solve the problem on each set, and • combine solutions.

  9. Computational Phylogenetics (2005) Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week Courtesy of the Tree of Life web project, tolweb.org

  10. Computational Phylogenetics (2015) 2012: Computing accurate trees (almost) without multiple sequence alignments 2009-2015: Co-estimation of multiple sequence alignments and gene trees, now on 1,000,000 sequences in under two weeks 2014-2015: Species tree estimation from whole genomes in the presence of massive gene tree heterogeneity Courtesy of the Tree of Life web project, tolweb.org

  11. Deletion Substitution …ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… • The true multiple alignment • Reflects historical substitution, insertion, and deletion events • Defined using transitive closure of pairwise alignments computed on edges of the true tree Insertion …ACCAGTCACCTA…

  12. Phylogenetic Tree Estimation S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

  13. Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

  14. Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

  15. Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

  16. Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

  17. Datasets: 1000-taxon simulated datasets under varying rates of evolution Biological datasets with structural alignments Liu et al., Science 2009 Evaluation of MSA methods (Science 2009) Alignment methods • Clustal • MAFFT • Muscle • Prank (PNAS 2005, Science 2008) • Opal (ISMB and Bioinf. 2007) Phylogeny estimation: • Maximum likelihood using RAxML

  18. 1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009)

  19. Observations • Large datasets can be easy to align with high accuracy if there is not too much heterogeneity. • Poor alignments produce poor trees.

  20. Observations • Highly accurate alignments are easy if the dataset is not too heterogeneous. • We can use phylogenies to decompose datasets into smaller, less heterogeneous datasets.

  21. A C B D Re-aligning on a tree Decompose dataset A B C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

  22. SATé and PASTA Input: set of unaligned sequences Output: multiple sequence alignment and tree • SATé: Liu et al., Science 2009 (up to 10,000 sequences) and Systematic Biology 2012 (up to 50,000 sequences) • PASTA: Mirarab et al., J. Comp Biol 2015 (up to 1,000,000 sequences)

  23. Obtain initial alignment and estimated ML tree Tree SATé and PASTA Algorithms Use tree to compute new alignment

  24. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Alignment SATé and PASTA Algorithms

  25. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé and PASTA Algorithms

  26. Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Estimate ML tree on new alignment Alignment SATé and PASTA Algorithms Repeat until termination condition, and return the alignment/tree pair with the best ML score

  27. 1000-taxon models, ordered by difficulty (Liu et al., Science 19 June 2009) SATé: 24-hour co-estimation of highly accurate alignments and trees on 1000 sequences 24-hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

  28. (Liu et al., SystBiol 61(1):90-106, 2012) SATé-2: even more accurate!

  29. PASTA: even more accurate, and can scale to 1,000,000 sequences • Simulated RNASim datasets from 10K to 200K taxa • Limited to 24 hours using 12 CPUs • Not all methods could run (missing bars could not finish) • PASTA, Mirarab et al., J Comp Biol 22(5): 377-386 (2015)

  30. Main Points • Innovative algorithm design can improve accuracy as well as reduce running time. • Divide-and-conquer is a key algorithmic technique that has dramatically changed the toolkit for biologists!

  31. Acknowledgments Funding: HHMI (to Siavash Mirarab) Guggenheim Foundation Packard Foundation NSF Microsoft Research New England David Bruton Jr. Centennial Professorship Grainger Foundation TACC (Texas Advanced Computing Center)

More Related