240 likes | 360 Vues
This lesson delves into the intricacies of phylogenetics, focusing on three primary methods: Distance, Parsimony, and Maximum Likelihood. We explore the UPGMA method for hierarchical clustering, analyzing how it combines closest pairs to construct phylogenetic trees. The discussion also covers the limitations of distance-based methods and introduces Fitch and Margoliash techniques for calculating branch lengths. Finally, we assess the Maximum Likelihood approach, emphasizing its reliance on mutation rates and probabilistic evaluations to build accurate phylogenetic trees. Whether you are new to bioinformatics or seeking a deeper understanding, this review encapsulates essential concepts and methodologies in phylogenetics.
E N D
Doug Raiford Lesson 10 Phylogeny (part III) Phylogenetics Part III
Review • Three methods • Have looked at two • Distance and parsimony • Leaves maximum likelihood Phylogenetics Part III
Review distance • UPGMA: hierarchical clustering • Start by finding closest two • Combine closest pair A B C D Phylogenetics Part III
Review parsimony • Looks at each column of an MSA and attempts to find a tree that describes • Builds a consensus tree AGCT AACT AACT AACT A or a G 0 if A 0 if A A A or a G 0 0 if A 1 if A 0 A A A G Phylogenetics Part III
UPGMA and varying rates • Averages distances to combined pair • Doesn’t accurately reflect branch lengths • Can lead to inaccurate trees A B C D Phylogenetics Part III
Other distance methods • Given a tree can calculate path through all branches for any given pair • Know all pair-wise distances (from matrix) • Fitch and Margoliash came up with a way to determine each of the branch lengths (unrooted only) X D XY: A,C,E,G,I A C F E B H G I Y Phylogenetics Part III
Pseudo code • Find two closest taxa • Find average distance from each of these to all other organisms Simultaneous equations Combine closest, and continue A C c a D f d g b B e E Phylogenetics Part III
Maximum likelihood • Similar to parsimony in that performed on each column of MSA • Given the known mutation rates… • All possible trees considered • Examined one column at a time • Probability instead of count • Maximize the probability of a tree match A C c a D f d g b B e E Phylogenetics Part III
Maximum likelihood • Can determine substitution rates from base composition • What substitution rates would result in the base composition staying the same • Overall rate = N/L=rate for single nucleotide • Equilibrium: A↓= C A+G A+T A • N = (A A)fA+(A C)fA+…(T T)fT • Simplifying assumptions • A C = C A • Proportion of each nucleotide stays the same Once again, Simultaneous Equations Phylogenetics Part III
Another view • Better seen in a matrix form • All rates sum to N • Also, A↓= C A+G A+T A Phylogenetics Part III
Given rates… • Tree generated for a column • Each branch represents a substitution • Have a rate • Rate is similar to a probability • In this case rate is not mutations per unit time • Derived from number of mutations divided by number of nucleotides • Can be thought of as “probability of a mutation at any given nucleotide” Rate || Probability Phylogenetics Part III
Probability of a tree • The probability (likelihood) that a tree is the result of the given set of mutations • Product rule: • Multiply the rates of each of the branches A C c a D f d g b B e E Phylogenetics Part III
Combined probability • Generate probability for each column for each tree • Combined probability is the sum of these probabilities atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag Phylogenetics Part III
Adjusting distances • Might not a mutation occur and then revert? • Simple count would not catch • The higher the degree of mutation, the greater the probability • Distances would be slightly greater Must account for reversions Phylogenetics Part III
Jukes and Kantor • Increased distance based upon this probability • K: substitutions per site • p: fraction of nucleotides different between two sequences • .08→.0846 Phylogenetics Part III
Kimura two parameter • Jukes and Kantor assumed a single mutation rate • Transversions less likely than transitions • A and G are purines, T and C are pyrimidines • Mutations that stay within family are more likely • Crossing families called transversion • P: fraction of transitions • Q: fraction of transversions Phylogenetics Part III
When use which? • Page 247 Choose set of related sequences Is their strong sequence similarity? Max Parsimony Obtain MSA Yes Distance Medium similarity (clearly recognizable)? Yes Max Likelihood Phylogenetics Part III
Summary • Branch length related to but not equal to time • MSA’s central due to differing mutation rates • 3 approaches • Distance • Parsimony • Maximum likelihood Choose set of related sequences Is their strong sequence similarity? Max Parsimony Obtain MSA Yes Distance Medium similarity (clearly recognizable)? Yes Max Likelihood Phylogenetics Part III
For non-bioinformaticians • Learned a clustering technique • Hierarchical • Great for exploratory data analysis • Learned some tree growth properties • Exposed to some statistical analysis • Maximum likelihood • Relationship of rates to probabilities Phylogenetics Part III
DNA and proteins, • If do rna must take covariance into account Phylogenetics Part III
Gene duplication events • Paralogs can be exploited • Evolve in a coordinated way • Find an organism that is similar but split off before the gene duplication even took place Phylogenetics Part II
Can convert score to dist • D = -log(S) Phylogenetics Part II
Go back to distance • Did simple count divided by num • Can use alignment score (if normalized for length) • What if mutate and then mutate back • Jukes and Cantor • Kimura two param Phylogenetics Part III