1 / 75

Molecular Phylogeny and Evolution

Molecular Phylogeny and Evolution. CISC 4020 Bioinformatics Spring 2012 Department of Computer and Information Science. Outline. Introduction to Evolution and Phylogeny Phylogenetic Tree Five stages of phylogenetic analysis. Evolution.

sidone
Télécharger la présentation

Molecular Phylogeny and Evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Molecular Phylogeny and Evolution CISC 4020 Bioinformatics Spring 2012 Department of Computer and Information Science

  2. Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics

  3. Evolution • Charles Darwin’s 1859 book (On the Origin of Species By Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life) introduced the theory of evolution. • Groups of organisms change over time so that descendants differ structurally and functionally from their ancestors. CISC 4020 Bioinformatics

  4. Natural Selection • To Darwin, the struggle for existence induces a natural selection. Offspring are dissimilar from their parents (that is, variability exists), and individuals that are more fit for a given environment are selected for. In this way, over long periods of time, species evolve. CISC 4020 Bioinformatics

  5. Molecular Evolution • The study of changes in genes and proteins throughout different branches of the tree of life. • At the molecular level, evolution is a process of mutation with selection. • Data from present-day organisms are studied to reconstruct the evolutionary history of species. CISC 4020 Bioinformatics

  6. Phylogeny • The inference of evolutionary relationships. • Traditionally, phylogeny relied on the comparison of morphological features between organisms. • Today, molecular sequence data are also used for phylogenetic analyses. CISC 4020 Bioinformatics

  7. Molecular Phylogeny • The study of the evolutionary relationships among organisms or among molecules using the techniques of molecular biology. • A true tree depicts the actual, historical events that occurred in evolution – it is impossible to generate such a tree. • We generate inferred trees, which depict a hypothesized version of the historical events, with the help of Multiple Sequence Alignments (MSA) of protein or DNA/RNA. CISC 4020 Bioinformatics

  8. Goals of molecular phylogeny • One object of molecular phylogeny is to deduce the correct trees for all species of life. • Analyzing molecular sequence data that define families of genes and proteins. • Another object is to infer or estimate the time of divergence between organisms since the time they last shared a common ancestor. CISC 4020 Bioinformatics

  9. Molecularclock hypothesis • The hypothesis of a molecular clock: • For every given gene or protein, the rate of molecular evolution is approximately constant in all evolutionary lineages. • The average rates of changes are distinctly different for each protein family. CISC 4020 Bioinformatics

  10. Molecularclock hypothesis • Implications: If protein sequences evolve at constant rates, they can be used to estimate the times that sequences diverged. This is analogous to dating geological specimens by radioactive decay. • Examples of divergence time estimated: • Beta and Delta globins occurred 44 MYA. • Beta and Gamma globins : 260 MYA. • Alpha and Beta globins: 565 MYA. CISC 4020 Bioinformatics

  11. Positive and negative selection • Darwin’s theory of evolution suggests that, at the phenotypic level, traits in a population that enhance survival are selected for, while traits that reduce fitness are selected against. • For example, among a group of giraffes millions of years in the past, those giraffes that had longer necks were able to reach higher foliage and were more reproductively successful than their shorter necked group members, that is, the taller giraffes were selected for. CISC 4020 Bioinformatics

  12. Positive and negative selection • In the mid-20th century, a conventional view was that molecular sequences are routinely subject to positive (or negative) selection. • Positive selection occurs when a sequence undergoes significantly increased rates of substitution, while negative selection occurs when a sequence undergoes change slowly. Otherwise, selection is neutral. CISC 4020 Bioinformatics

  13. Neutral theory of evolution • An often-held view of evolution is that just as organisms propagate through natural selection, so also DNA and protein molecules are selected for. • According to Motoo Kimura’s 1968 neutral theory of molecular evolution, the vast majority of DNA changes are not selected for in a Darwinian sense. The main cause of evolutionary change is random drift of mutant alleles that are selectively neutral (or nearly neutral). Positive Darwinian selection does occur, but it has a limited role. CISC 4020 Bioinformatics

  14. Neutral theory of evolution • The existence of a molecular clock makes sense in the context of the neutral hypothesis because most amino acid substitutions are neutral. • Substitutions are tolerated by natural selection to change in a manner that has clock-like properties. • If substitutions occurred primarily in the context of positive or negative selection, it is unlikely that they could account for clock-like evolution. CISC 4020 Bioinformatics

  15. Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics

  16. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Phylogenetic Tree • The technique of molecular biology for studying evolutionary relationships among organisms using molecular sequence data – DNA or protein. • A phylogenetic tree is a graph composed of branches and nodes. CISC 4020 Bioinformatics

  17. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Tree nomenclature Node (intersection or terminating point of two or more branches) branch (edge) A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics

  18. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Node of Tree - Taxon A taxonomic category or group, such as family, and species. taxon taxon A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics

  19. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Leaf Node of Tree - OTU operational taxonomic unit (OTU) an extant taxon, such as a protein sequence that we analyze. A 2 1 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics

  20. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Internal Node of Tree An inferred ancestor of the OTUs. A 2 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics

  21. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Branch of Tree Branches are unscaled... Branches are scaled... A 2 1 1 B 2 C 2 2 1 D 6 one unit E …OTUs are neatly aligned, and nodes reflect time …branch lengths are proportional to number of amino acid changes CISC 4020 Bioinformatics

  22. 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Branch of Tree bifurcating internal node multifurcating internal node A 2 1 B 2 C 2 2 1 D 6 one unit E CISC 4020 Bioinformatics

  23. Examples of multifurcation: failure to resolve the branching order of some metazoans and protostomes Rokas A. et al., Animal Evolution and the Molecular Signature of Radiations Compressed in Time, Science 310:1933 (2005), Fig. 1. CISC 4020 Bioinformatics

  24. Tree nomenclature: clades Clade ABF (monophyletic group) : The common ancestor and its children. A 2 F 1 1 B G 2 I H 2 C 1 D 6 E time CISC 4020 Bioinformatics

  25. Tree nomenclature 2 A F 1 1 G B 2 I H 2 C Clade CDH 1 D 6 E time CISC 4020 Bioinformatics

  26. Tree nomenclature Clade ABF/CDH/G 2 A F 1 1 G B 2 I H 2 C 1 D 6 E time CISC 4020 Bioinformatics

  27. Examples of clades Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10 CISC 4020 Bioinformatics

  28. Tree roots The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor. A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs). CISC 4020 Bioinformatics

  29. Tree nomenclature: roots past 9 1 5 7 8 6 7 8 2 3 present 4 2 6 4 5 3 1 Rooted tree (specifies evolutionary path) Unrooted tree (The direction of time is undetermined.) CISC 4020 Bioinformatics

  30. Tree nomenclature: outgroup rooting past root 9 10 A homologous bacterial protein 7 8 7 9 6 8 2 3 2 3 4 present 4 6 Outgroup (used to place the root) 5 1 5 1 Rooted tree 5 human being myoglobin orthologs CISC 4020 Bioinformatics

  31. Numbers of possible trees extremely large for >10 sequences Number Number of Number of of OTUs rooted trees unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 10 34,459,425 105 20 8 x 1021 2 x 1020 CISC 4020 Bioinformatics

  32. Outline • Introduction to Evolution and Phylogeny • Phylogenetic Tree • Five stages of phylogenetic analysis CISC 4020 Bioinformatics

  33. Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics

  34. Stage 1: Use of DNA, RNA, or Protein • For phylogeny, DNA can be more informative. • The protein-coding portion of DNA has synonymous and nonsynonymous substitutions. Thus, some DNA changes do not have corresponding protein changes. • A synonymous substitution does not result in a change in the amino acid that is specified. CISC 4020 Bioinformatics

  35. Stage 1: Use of DNA, RNA, or protein • For phylogeny, DNA can be more informative. • Some substitutions in a DNA sequence alignment can be directly observed: single nucleotide substitutions, sequential substitutions, coincidental substitutions. CISC 4020 Bioinformatics

  36. Substitutions in a DNA sequence alignment can be directly observed, or inferred CISC 4020 Bioinformatics

  37. CISC 4020 Bioinformatics

  38. Stage 1: Use of DNA, RNA, or protein • For phylogeny, DNA can be more informative. • Noncoding regions (such as 5’ and 3’ untranslated regions) may be analyzed using molecular phylogeny. CISC 4020 Bioinformatics

  39. CISC 4020 Bioinformatics

  40. CISC 4020 Bioinformatics

  41. Stage 1: Use of DNA, RNA, or protein • For phylogeny, protein sequences are also often used. • Proteins have 20 states (amino acids) instead of only four for DNA, so there is a stronger phylogenetic signal. • Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value. Nucleotides are unordered characters: any one nucleotide can change to any other in one step. CISC 4020 Bioinformatics

  42. Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics

  43. Stage 2: Multiple sequence alignment • Confirm that all sequences are homologous • Adjust gap creation and extension penalties as needed to optimize the alignment • Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps). CISC 4020 Bioinformatics

  44. open circles: positions that distinguish myoglobins, alpha globins, beta globins 100% conserved gaps CISC 4020 Bioinformatics

  45. Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation CISC 4020 Bioinformatics

  46. Stage 3: Models of DNA and Amino Acid Substitution • The simplest approach to defining the relatedness of a group of nucleotide (or amino acid) sequences is to align pairs of sequences, and then to count the number of differences. • The degree of divergence is called the Hamming distance. For an alignment of length N with n sites at which there are differences, the degree of divergence D is: D = n / N • But observed differences do not equal genetic distance! Genetic distance involves mutations that are not observed directly. CISC 4020 Bioinformatics

  47. 3 4 4 3 D = (- ) ln (1 – p) Stage 3: Models of DNA and Amino Acid Substitution Jukes and Cantor (1969) proposed a corrective formula: p is the proportion of residues that differ. This model describes the probability that one nucleotide will change into another. It assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions). In practice, the transition is typically greater than the transversion rate. CISC 4020 Bioinformatics

  48. Models of nucleotide substitution transition A G transversion transversion T C transition CISC 4020 Bioinformatics

  49. Jukes and Cantor one-parameter model of nucleotide substitution (a=b) a A G a a a a T C a CISC 4020 Bioinformatics

  50. a A G b b b b T C a Kimura model of nucleotide substitution (assumes a ≠ b) More Weight is given to Transversion for causing nonsynonymous changes in protein-coding regions. CISC 4020 Bioinformatics

More Related