1 / 45

Phylogenetic Analysis based on two talks, by

Phylogenetic Analysis based on two talks, by. Caro-Beth Stewart, Ph.D . Department of Biological Sciences University at Albany, SUNY c.stewart@albany.edu and Tal Pupko, Ph.D. Faculty of Life Science Tel-Aviv University talp@post.tau.ac.il.

maxime
Télécharger la présentation

Phylogenetic Analysis based on two talks, by

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Analysisbased on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences University at Albany, SUNY c.stewart@albany.edu and Tal Pupko, Ph.D. Faculty of Life Science Tel-Aviv University talp@post.tau.ac.il Based on lectures by C-B Stewart, and by Tal Pupko

  2. What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference or “tree building” — the inference of the branching orders, and ultimately the evolutionary relationships, between “taxa” (entities such as genes, populations, species, etc.) 2. Character and rate analysis — using phylogenies as analytical frameworks for rigorous understanding of the evolution of various traits or conditions of interest Based on lectures by C-B Stewart, and by Tal Pupko

  3. Common Phylogenetic Tree Terminology Terminal Nodes Branches or Lineages A Represent the TAXA (genes, populations, species, etc.) used to infer the phylogeny B C D Ancestral Node or ROOT of the Tree E Internal Nodes or Divergence Points (represent hypothetical ancestors of the taxa) Based on lectures by C-B Stewart, and by Tal Pupko

  4. Taxon B Taxon C No meaning to the spacing between the taxa, or to the order in which they appear from top to bottom. Taxon A Taxon D Taxon E This dimension either can have no scale (for ‘cladograms’), can be proportional to genetic distance or amount of change (for ‘phylograms’ or ‘additive trees’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). Phylogenetic trees diagram the evolutionary relationships between the taxa ((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses These say that B and C are more closely related to each other than either is to A, and that A, B, and C form a clade that is a sister group to the clade composed of D and E. If the tree has a time scale, then D and E are the most closely related. Based on lectures by C-B Stewart, and by Tal Pupko

  5. A few examples of what can be inferred from phylogenetic trees built from DNAor protein sequence data: • Which species are the closest living relatives of modern humans? • Did the infamous Florida Dentist infect his patients with HIV? • What were the origins of specific transposable elements? • Plus countless others….. Based on lectures by C-B Stewart, and by Tal Pupko

  6. Which species are the closest living relatives of modern humans? Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. Humans Gorillas Chimpanzees Chimpanzees Bonobos Bonobos Gorillas Orangutans Orangutans Humans 14 0 0 15-30 MYA MYA The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA. Based on lectures by C-B Stewart, and by Tal Pupko

  7. No No Did the Florida Dentist infect his patients with HIV? DENTIST Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People: Patient C Patient A Patient G Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. Patient B Patient E Patient A DENTIST Local control 2 Local control 3 Patient F Local control 9 Local control 35 Local control 3 Patient D Based on lectures by C-B Stewart, and by Tal Pupko From Ou et al. (1992) and Page & Holmes (1998)

  8. A few examples of what can be learned from character analysis using phylogenies as analytical frameworks: • When did specific episodes of positive Darwinian selection occur during evolutionary history? • Which genetic changes are unique to the human lineage? • What was the most likely geographical location of the common ancestor of the African apes and humans? • Plus countless others….. Based on lectures by C-B Stewart, and by Tal Pupko

  9. A B A C C D B C D A E B C A D E B F The number of unrooted trees increases in a greater than exponential manner with number of taxa (2N - 5)!! = # unrooted trees for N taxa Based on lectures by C-B Stewart, and by Tal Pupko

  10. B C Root D A A C B D Rooted tree Note that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D. Root Inferring evolutionary relationships between the taxa requires rooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Unrooted tree Based on lectures by C-B Stewart, and by Tal Pupko

  11. Now, try it again with the root at another position: B C Root Unrooted tree D A A B C D Rooted tree Note that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D. Root Based on lectures by C-B Stewart, and by Tal Pupko

  12. 2 4 1 5 3 Rooted tree 1a Rooted tree 1b Rooted tree 1c Rooted tree 1d Rooted tree 1e B A A C D A B D C B C C C A A D B B D D An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted trees A C The unrooted tree 1: D B These trees showfive different evolutionary relationships among the taxa! Based on lectures by C-B Stewart, and by Tal Pupko

  13. There are two major ways to root trees: By outgroup: Uses taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup By midpoint or distance: Roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. This assumption is built into some of the distance-based tree building methods. A d (A,D) = 10 + 3 + 5 = 18 Midpoint = 18 / 2 = 9 10 C 3 2 2 B D 5 Based on lectures by C-B Stewart, and by Tal Pupko

  14. C A x = D B C A D B E C A D E B F (2N - 3)!! = # unrooted trees for N taxa Each unrooted tree theoretically can be rooted anywhere along any of its branches Based on lectures by C-B Stewart, and by Tal Pupko

  15. COMPUTATIONAL METHOD Optimality criterion Clustering algorithm PARSIMONY MAXIMUM LIKELIHOOD Characters DATA TYPE MINIMUM EVOLUTION LEAST SQUARES UPGMA NEIGHBOR-JOINING Distances Molecular phylogenetic tree building methods: Are mathematical and/or statistical methods for inferring the divergence order of taxa, as well as the lengths of the branches that connect them. There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as follows: Based on lectures by C-B Stewart, and by Tal Pupko

  16. Types of data used in phylogenetic inference: Character-based methods:Use the aligned characters, such as DNA or protein sequences, directly during tree inference. TaxaCharacters Species A ATGGCTATTCTTATAGTACG Species B ATCGCTAGTCTTATATTACA Species C TTCACTAGACCTGTGGTCCA Species D TTGACCAGACCTGTGGTCCG Species E TTGACCAGTTCTCTAGTTCG Distance-based methods:Transform the sequence data into pairwise distances (dissimilarities), and then use the matrix during tree building. A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- Example 1: Uncorrected “p” distance (=observed percent sequence difference) Based on lectures by C-B Stewart, and by Tal Pupko Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)

  17. Computational methods for finding optimal trees: Exact algorithms:"Guarantee" to find the optimal or "best" tree for the method of choice. Two types used in tree building: Exhaustive search: Evaluates all possible unrooted trees, choosing the one with the best score for the method. Branch-and-bound search: Eliminates the parts of the search tree that only contain suboptimal solutions. Heuristic algorithms: Approximate or “quick-and-dirty” methods that attempt to find the optimal tree for the method of choice, but cannot guarantee to do so. Heuristic searches often operate by “hill-climbing” methods. Based on lectures by C-B Stewart, and by Tal Pupko

  18. A B A C C D B C D A E B C A D E B F Exact searches become increasingly difficult, and eventually impossible, as the number of taxa increases: (2N - 5)!! = # unrooted trees for N taxa Based on lectures by C-B Stewart, and by Tal Pupko

  19. Heuristic search algorithms are input order dependent and can get stuck in local minima or maxima Rerunning heuristic searches using different input orders of taxa can help find global minima or maxima Search for global maximum Search for global minimum GLOBAL MAXIMUM GLOBAL MAXIMUM local maximum local minimum GLOBAL MINIMUM GLOBAL MINIMUM Based on lectures by C-B Stewart, and by Tal Pupko

  20. Classification of phylogenetic inference methods COMPUTATIONAL METHOD Optimality criterion Clustering algorithm PARSIMONY MAXIMUM LIKELIHOOD Characters DATA TYPE MINIMUM EVOLUTION LEAST SQUARES UPGMA NEIGHBOR-JOINING Distances Based on lectures by C-B Stewart, and by Tal Pupko

  21. Parsimony methods: • Optimality criterion: The ‘most-parsimonious’ tree is the one that • requires the fewest number of evolutionary events (e.g., nucleotide • substitutions, amino acid replacements) to explain the sequences. • Advantages: • Are simple, intuitive, and logical (many possible by ‘pencil-and-paper’). • Can be used on molecular and non-molecular (e.g., morphological) data. • Can tease apart types of similarity (shared-derived, shared-ancestral, homoplasy) • Can be used for character (can infer the exact substitutions) and rate analysis. • Can be used to infer the sequences of the extinct (hypothetical) ancestors. • Disadvantages: • Are simple, intuitive, and logical (derived from “Medieval logic”, not statistics!) • Can be fooled by high levels of homoplasy (‘same’ events). • Can become positively misleading in the “Felsenstein Zone”: • [See Stewart (1993) for a simple explanation of parsimony analysis, and Swofford • et al. (1996) for a detailed explanation of various parsimony methods.] Based on lectures by C-B Stewart, and by Tal Pupko

  22. Branch and Bound Tal Pupko, Tel-Aviv University Based on lectures by C-B Stewart, and by Tal Pupko

  23. There are many trees.., We cannot go over all the trees. We will try to find a way to find the best tree. There are approximate solutions… But what if we want to make sure we find the global maximum. There is a way more efficient than just go over all possible tree. It is called BRANCH AND BOUND and is a general technique in computer science, that can be applied to phylogeny. Based on lectures by C-B Stewart, and by Tal Pupko

  24. BRANCH AND BOUND To exemplify the BRANCH AND BOUND (BNB) method, we will use an example not connected to evolution. Later, when the general BNB method is understood, we will see how to apply this method to finding the MP tree. We will present the traveling salespersonpath problem (TSP). Based on lectures by C-B Stewart, and by Tal Pupko

  25. THE TSP PROBLEM (especially adapted to israel). A guard has to visit n check-points whose location on a map is known. The problem is to find the shortest path that goes through all points exactly once (no need to come back to starting point). Naïve approach: (say for 5 points). You have 5 starting points. For each such starting point you have 4 “next steps”. For each such combination of starting point and first step, you have 3 possible second steps, etc. All together we have 5*4*3*2*1 Possible solutions = 5! . Based on lectures by C-B Stewart, and by Tal Pupko

  26. 1 1 1 1 2 2 3 2 3 2 4 4 3 4 3 5 5 5 5 4 THE TSP TREE 1 2 3 4 5 1 4 5 1 2 5 1 2 4 2 4 5 4 5 2 5 2 4 5 4 5 2 4 2 Based on lectures by C-B Stewart, and by Tal Pupko

  27. THE SHP NAÏVE APPROACH Each solution can be represented as a permutation: (1,2,3,4,5) (1,2,3,5,4) (1,2,4,3,5) (1,2,4,5,3) (1,2,5,3,4) … We can go over the list and find the one giving the highest score. Based on lectures by C-B Stewart, and by Tal Pupko

  28. THE SHP NAÏVE APPROACH However, for 15 points, for example, there are 1,307,674,368,000 The rate of increase of the number of solutions is too fast for this to be practical. Based on lectures by C-B Stewart, and by Tal Pupko

  29. A TSP GREEDY HEURISTIC Start from a random point. Go to the closest point. Go to its closest point, etc.etc. This approach doesn’t work so well… (but a reasonably close heuristic, based on simulated annealing, will be presented in a couple of lectures.) Based on lectures by C-B Stewart, and by Tal Pupko

  30. 1 1 1 2 2 2 3 3 3 4 4 3 5 5 5 4 BNB SOLUTION TO SHP 1 2 3 4 5 1 2 4 5 Score here already 16: no point in expanding the rest of the subtree Shortest path found so far = 15 1 4 5 1 2 5 1 2 4 2 4 5 4 5 2 5 2 4 5 4 5 2 4 2 Based on lectures by C-B Stewart, and by Tal Pupko

  31. Back to finding the MP tree Finding the MP tree is NP-Hard (will see shortly)… BNB helps, though it is still exponential… Based on lectures by C-B Stewart, and by Tal Pupko

  32. 1 1 1 1 3 3 3 3 2 2 2 2 4 The MP search tree 4 is added to branch 1. 4 4 5 is added to branch 2. There are 5 branches Based on lectures by C-B Stewart, and by Tal Pupko

  33. The MP search tree 30 4 is added to branch 1. 43 39 55 52 54 52 53 58 61 56 59 61 69 53 51 42 47 47 Based on lectures by C-B Stewart, and by Tal Pupko

  34. MP-BNB 30 4 is added to branch 1. 43 39 55 52 54 52 53 58 61 56 59 61 69 53 51 42 47 47 Best (minimum) value = 52 Based on lectures by C-B Stewart, and by Tal Pupko

  35. MP-BNB 30 4 is added to branch 1. 43 39 55 52 54 52 53 58 61 56 59 61 69 53 51 42 47 47 Best record = 52 Based on lectures by C-B Stewart, and by Tal Pupko

  36. MP-BNB 30 4 is added to branch 1. 43 39 55 52 54 52 53 58 61 56 59 61 69 53 51 42 47 47 Best record = 52 Based on lectures by C-B Stewart, and by Tal Pupko

  37. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best record = 52 Based on lectures by C-B Stewart, and by Tal Pupko

  38. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best record = 52 Based on lectures by C-B Stewart, and by Tal Pupko

  39. MP-BNB 30 43 39 55 52 54 52 53 53 58 58 53 51 42 47 47 Best record = 52 51 Based on lectures by C-B Stewart, and by Tal Pupko

  40. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best record = 52 51 42 Based on lectures by C-B Stewart, and by Tal Pupko

  41. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best record = 52 51 42 Based on lectures by C-B Stewart, and by Tal Pupko

  42. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best record = 52 51 42 Based on lectures by C-B Stewart, and by Tal Pupko

  43. MP-BNB 30 43 39 55 52 54 52 53 58 53 51 42 47 47 Best TREE. MP score = 42 Total # trees visited: 14 Based on lectures by C-B Stewart, and by Tal Pupko

  44. Order of Evaluation Matters The bound after searching this subtree will be 42. 30 Evaluate all 3 first 43 39 55 53 51 42 47 47 Total tree visited: 9 Based on lectures by C-B Stewart, and by Tal Pupko

  45. And NowMaximum Parsimony is Computationally IntractableFelsenstein’s Dynamic Programming Algorithm for tiny maximum likelihoodand more, time permitting Based on lectures by C-B Stewart, and by Tal Pupko

More Related