CIPRES: Enabling Tree of Life Projects

CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin

Reconstructing the “Tree” of Life Handling large datasets: millions of species The “Tree of Life” is not really a tree: reticulate evolution

Cyber Infrastructure for Phylogenetic Research Purpose: to create a national infrastructure of hardware, open source software, database technology, etc., necessary to infer the Tree of Life. Group: 40 biologists, computer scientists, and mathematicians from 13 institutions. Funding: $11.6 M (large ITR grant from NSF). URL: http://www.phylo.org

University of New Mexico Bernard Moret David Bader UCSD/SDSC Fran Berman Alex Borchers Phil Bourne John Huelsenbeck Terri Liebowitz Mark Miller University of Connecticut Paul O Lewis University of Pennsylvania Junhyong Kim Susan Davidson Sampath Kannan Val Tannen Texas A&M Tiffani Williams UT Austin Tandy Warnow David M. Hillis Warren Hunt Robert Jansen Randy Linder Lauren Meyers Daniel Miranker University of Arizona David R. Maddison University of British Columbia Wayne Maddison North Carolina State University Spencer Muse American Museum of Natural History Ward C. Wheeler NJIT Usman Roshan UC Berkeley Satish Rao Steve Evans Richard M Karp Brent Mishler Elchanan Mossel Eugene W. Myers Christos M. Papadimitriou Stuart J. Russell Rice Luay Nakhleh SUNY Buffalo William Piel Florida State University David L. Swofford Mark Holder Yale Michael Donoghue Paul Turner CIPRes Members

CIPRES activity • Databases - e.g. TreeBase II (Bill Piel and others) • Simulations of large-scale complex genome-scale evolution (Junhyong Kim) • Outreach (Michael Donoghue and Brent Mishler) • Algorithms (Tandy Warnow) • Open source software (Wayne Maddison, Dave Swofford, Mark Holder, and Bernard Moret) • Computer cluster at SDSC (Fran Berman and Mark Miller) - available to ATOL projects and other groups with datasets above 1000 taxa

CIPRES research in algorithms • Multiple sequence alignment • Genomic alignment • Heuristics for Maximum Parsimony and Maximum Likelihood • Bayesian MCMC methods • Supertree methods • Whole genome phylogeny reconstruction • Reticulate evolution detection and reconstruction • Data mining on sets of trees, and compact representations of these sets

Software distributions The first distribution (in the next months) will focus on Rec-I-DCM3(PAUP*): fast heuristic searches for maximum parsimony on large datasets for PAUP* users All software will be open source Community contributions to software will be enabled

Local optimum Cost Global optimum Phylogenetic trees Phylogenetic reconstruction methods • Heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) - hard to solve on large datasets • Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. - poor accuracy on datasets with large evolutionary distances

DCMs: Divide-and-conquer for improving phylogeny reconstruction

“Boosting” phylogeny reconstruction methods • DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method M DCM-M

DCMs (Disk-Covering Methods) • DCMs for polynomial time methods improve topological accuracy (empirical observation), and have provable theoretical guarantees under Markov models of evolution • DCMs for hard optimization problems reduce running time needed to achieve good levels of accuracy (empirically observation)

DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001] • DCM1-boosting makes distance-based methods more accurate • Theoretical guarantees that DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

Major challenge: MP and ML • Maximum Parsimony (MP) and Maximum Likelihood (ML) remain the methods of choice for most systematists • The main challenge here is to make it possible to obtain good solutions to MP or ML in reasonable time periods on large datasets

Solving NP-hard problems exactly is … unlikely • Number of (unrooted) binary trees on n leaves is (2n-5)!! • If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in 2890 millennia

How good an MP analysis do we need? • Our research shows that we need to get within 0.01% of optimal (or better even, on large datasets) to return reasonable estimates of the true tree’s “topology”

Problems with current techniques for MP Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Observations • The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets. • Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions. • Apparent convergence can be misleading.

Our objective: speed up the best MP heuristics Fake study Performance of hill-climbing heuristic MP score of best trees Desired Performance Time

Input: Set S of sequences, and guide-tree T 1. Compute short subtree graph G(S,T), based upon T 2. Find clique separator in the graph G(S,T) and form subproblems DCM3 decomposition • DCM3 decompositions • can be obtained in O(n) time • (2) yield small subproblems • (3) can be used iteratively • (4) can be applied recursively

Iterative-DCM3 T DCM3 Base method T’

New DCMs • DCM3 • Compute subproblems using DCM3 decomposition • Apply base method to each subproblem to yield subtrees • Merge subtrees using the Strict Consensus Merger technique • Randomly refine to make it binary • Recursive-DCM3 • Iterative DCM3 • Compute a DCM3 tree • Perform local search and go to step 1 • Recursive-Iterative DCM3

Rec-I-DCM3 significantly improves performance Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

Datasets Obtained from various researchers and online databases • 1322 lsu rRNA of all organisms • 2000 Eukaryotic rRNA • 2594 rbcL DNA • 4583 Actinobacteria 16s rRNA • 6590 ssu rRNA of all Eukaryotes • 7180 three-domain rRNA • 7322 Firmicutes bacteria 16s rRNA • 8506 three-domain+2org rRNA • 11361 ssu rRNA of all Bacteria • 13921 Proteobacteria 16s rRNA

Rec-I-DCM3(TNT) vs. TNT(Comparison of scores at 24 hours) Base method is the default TNT technique, the current best method for MP. Rec-I-DCM3 significantly improves upon the unboosted TNT by returning trees which are at most 0.01% above optimal on most datasets.

Observations • Rec-I-DCM3 improves upon the best performing heuristics for MP. • The improvement increases with the difficulty of the dataset.

DCMs • DCM for NJ and other distance methods produces absolute fast converging (afc) methods • DCMs for MP heuristics • DCMs for use with the GRAPPA software for whole genome phylogenetic analysis; these have been shown to let GRAPPA scale from its maximum of about 15-20 genomes to 1000 genomes. • Current projects: DCM development for maximum likelihood and multiple sequence alignment.

A C A D X E Y B E Z W C F B D F Part II: Whole-Genome Phylogenetics

1 2 3 –8 –7 –6 –5 -4 9 10 1 2 3 9 -8 –7 –6 –5 –4 10 1 2 3 9 4 5 6 7 8 10 Genomes Evolve by Rearrangements 1 2 3 4 5 6 7 8 9 10 • Inversion (Reversal) • Transposition • Inverted Transposition

Genome Rearrangement Has A Huge State Space • DNA sequences : 4 states per site • Signed circular genomes with n genes: states, 1 site • Circular genomes (1 site) • with 37 genes: states • with 120 genes: states

Why use gene orders? • “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate. • Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

A A D D B B 3 3 Total length = 18 6 C C E F 4 2 Maximum Parsimony on Rearranged Genomes (MPRG) • The leaves are rearranged genomes. • Find the tree that minimizes the total number of rearrangement events (NP-hard)

“Solving” the inversion phylogeny • Usual issue of getting stuck in local optima, since the optimization problems are NP-hard • Additional problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML). Local optimum MP score Global optimum Phylogenetic trees

Benchmark gene order dataset: Campanulaceae • 12 genomes + 1 outgroup (Tobacco), 105 gene segments • NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) Joint work with Bob Jansen, Linda Raubeson, Jijun Tang, and Li-San Wang 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)

GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms) http://www.cs.unm.edu/~moret/GRAPPA/ • Heuristics for NP-hard optimization problems • Fast polynomial time distance-based methods • Contributors: U. New Mexico, U. Texas at Austin, Universitá di Bologna, Italy • Freely available in source code at this site. • Project leader: Bernard Moret (UNM) (moret@cs.unm.edu)

Limitations and ongoing research • Current methods are mostly limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications). • We have made some progress on developing a reliable distance-based method for chromosomes with unequal gene content (tests on real and simulated data show high accuracy) • Handling the multiple chromosome case is harder

Acknowledgements • NSF • The David and Lucile Packard Foundation • The Program in Evolutionary Dynamics at Harvard • The Institute for Cellular and Molecular Biology at UT-Austin See http://www.phylo.org and http://www.cs.utexas.edu/~tandy for more info

CIPRES: Enabling Tree of Life Projects

CIPRES: Enabling Tree of Life Projects

Presentation Transcript

Tree of Life

Tree of Life

CIPRES: Enabling Tree of Life Projects

The Tree of Life

The Tree of Life

Tree of Life

The Tree of Life

Tree of Life

The Tree of Life

Tree of Life

Tree of Life

Enabling Phylogenetic Research via the CIPRES Science Gateway

Tree of Life

CIPRES outreach

The Tree of Life

Tree of Life

Tree of Life

Tree of Life

The Tree of Life

Tree of Life

TREE OF LIFE FINANCIAL