1 / 40

Evolution (1 st lecture)

Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution. Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper & all Identification and Characterization of Multi-Species Conserved Sequences Elliott Margulies & all

tareq
Télécharger la présentation

Evolution (1 st lecture)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution (1st lecture)

  2. Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper & all Identification and Characterization of Multi-Species Conserved Sequences Elliott Margulies & all Presented by Penka Markova

  3. Finding Elements in DNA Conserved by Evolution Premise: highly conserved sequences are more likely to reflect regions under active selection due to the presence of an element(s) that confers biological function Involves comparative analysis, requires multi-alignments

  4. Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes • Overview • Data • Global Patterns of Nucleotide Substitution • Rates of Transitions and Transversions in the Rodents • Rates of Neutral Point Substitution • Rates of Microinsertion and Microdeletion • Global Identification of Constrained Elements • Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences • Overview • Data • Binomial, Parsimony and Intersecting Methods • Stats • Characteristics of the detected MCSs, conclusions

  5. 1st Paper Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes Gregory Cooper, Michael Brudno, Eric Stone, Inna Dubchak, Serafim Batzoglou, and Arend Sidow

  6. Overview Goal: Comparative analysis of rat/mouse/human genome • facilitate insights into basic mechanisms of nucleotide evolution • facilitate the discovery of elements in the genome that play a functional role in human biology (by leveraging the fact that functional DNA is constrained because of purifying selection ) Summary: Provides analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor • Evidence for shift in the mutational spectrum b/n the mouse and rat lineages (increase of CG content in the rat genome) • Support for the idea that rates of evolution are influenced by local genomic or cell biological context • No correlation b/n rates of point substitution & rates of microindels (influences that affect these processes are distinct) • Identified the regions in the human genome that are evolving slowly (likely to include functional elements important to human biology)

  7. Data 3 complete mammalian genome sequences • Human, rat, mouse • new: rat genome Multi-aligned • MLAGAN 2 datasets • Containing all sites that are confidently aligned among all 3 sequences (most included positions originated prior to the last common ancestor) • “rodent-specific neutral sites” -containing only sites present in the rodents (heavily enriched for neutrally evolving sites)

  8. Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes • Overview • Data • Global Patterns of Nucleotide Substitution • Rates of Transitions and Transversions in the Rodents • Rates of Neutral Point Substitution • Rates of Microinsertion and Microdeletion • Global Identification of Constrained Elements • Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences • Overview • Data • Binomial, Parsimony and Intersecting Methods • Stats • Characteristics of the detected MCSs, conclusions

  9. Global Patterns of Nucleotide Substitution Global shift in the mutation spectra between mouse and rat • Rat has 0.35% more CG than mouse (41.26% vs 41.61%) – statistically highly significant difference • CpG dinucleotides 0.92% in the mouse, 1.06% in the rat (the rest of the nucleotides exhibit lower difference) Consistent bias toward elevated CG in the rat genome • does not appear to be confined to particular types of transitions or transversions • based on Dataset1 quantitative analysis (117 million position with single difference in either rodent) The causative factors for the shift, selective or otherwise, remain to be elucidated

  10. Rates of Transitions and Transversions in the Rodents Transitions are approximately fourfold more likely than any transversion Useful for molecular evolutionary studies (most methods of phylogenetic inference model point substitutions on the basis of stationary Markov processes and require user-specified substitution parameters)

  11. Rates of Neutral Point Substitution Point substitution events in rodent-specific neutral sites (Dataset2) Neutral rate for the evolutionary tree relating the 3 • Relative branch length of the tree: based on Dataset1 positions without gap in any sequence • Normalized (rat branch is 1 unit length)

  12. Rates of Microinsertion and Microdeletion Definition: lesions no larger than 10bp Dataset1 • Gaps of size 11bp or less Rapid decline in the relative numbers of indel events as size increases

  13. Global Identification of Constrained Elements Annotated all the regions in the human genome that are evolving, on average, significantly slower than the neutral rate • Sequences that function in organismal biology tend to be under purifying selection & thus manifest themselves as regions evolving slowly • 210, 923 constrained elements (>51 bp)

  14. Global Identification of Constrained Elements

  15. Regional Variability of Evolutionary Parameters • Substantially stable microevolutionary pressures (modest-to-strong correlations between rates of microdeletion [A, B]) • Local evolutionary pressures appear to influence point substitutions and microindels differently (variation in rate of microinsertions/microdeletion does not correlate well with point substitution) • Local genomic context influences the rate of point substitution regardless of the type of site (correlation b/n neutral rate with the rate of substitution [B]) • CG content correlates with rates of point substitution Sliding window analysis along rat Chromosome1, window width of 2Mb

  16. Outline Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes • Overview • Data • Global Patterns of Nucleotide Substitution • Rates of Transitions and Transversions in the Rodents • Rates of Neutral Point Substitution • Rates of Microinsertion and Microdeletion • Global Identification of Constrained Elements • Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences • Overview • Data • Binomial, Parsimony and Intersecting Methods • Stats • Characteristics of the detected MCSs, conclusions

  17. 2nd Paper Identification and Characterization of Multi-Species Conserved Sequences Elliott Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program, David Haussler, Eric Green

  18. Overview Goals Identify highly conserved DNA regions, in particular “Multi-species Conserved Sequences” (MCSs), in a robust fashion • useful in comparative sequence analysis, aiming to elucidate genome function Evaluate the relative contribution of different species’ sequences to identifying genomic regions of interest • one of the criteria considered in choosing additional species for whole-genome sequencing Summary of results Proposes 2 strategies for MCS identification (binomial, parsimony) • detect virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) Analysis of the features of detected MCSs Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

  19. Data Sequences of human and 11 non-human vertebrates • 2 primates (chimpansee, baboon), 2 carnivores (cat, dog), 2 artiodactyls (cow and pig), 2 rodents (mouse and rat), 1 bird (chicken), 2 fish (fugu and tetraodon) • Orthologous to a 1.8-Mb region on human chromosome 7q31 Multi-aligned • human-referenced pair-wise alignment • Repeat-masker, blastz Systematically annotated for known coding exons, UTRs, and ARs (ancestral repeats)

  20. Algorithms: Binomial, Parsimony, Intersecting Take into account Phylogenetic diversity of the aligned species’ sequences The varying neutral substitution rate The characteristics of the available genomic multi-sequence alignment, esp sparse alignments Requirements Sufficiently large branch length of the phylogenetic tree (non-functional regions should be sufficiently diverged) Greater total branch length (compared to the required length for identification of larger functional elements) Good multi-alignment is crucial

  21. Algorithms: Binomial Binomial-Based Method for MCS Detection Calculates the conservation score based on the probability of detecting the observed amount of conservation between the human and each other species’ sequence, assuming neutral substitution rate Neutral substitution rate is calculated from fourfold degenerate positions (the third base of codons for which any base will encode the same amino acid) Normalizes for phylogenetic biases by averaging Final conservation score is calculated from overlapping 25-base windows

  22. Algorithms: Binomial N number of aligned bases in the 25-base window of the human-species j alignment K number of perfect matches pjneutral substitution probability: the probability that a given base in the human sequence has been conserved in species j, assuming the neutral substitution rate between human and species j K/N baseline conservation level C(j) cumulative binomial probability of observing at least K matches in N bases Algorithm 1) within all windows of 25 bases, for each species j: CGGCTAAG…ACTGACTGGGT CGACTGAG…ACTGACTGGGT

  23. Algorithms: Binomial Algorithm 2) “phylogenetically average” the individual species’ scores sj to obtain the final conservation score for the window 3) the final score assigned to position i is 4) For a given treshhold t, position I is predicted to be part of an MCS if

  24. Algorithms: Binomial Binomial-Based Method: Conclusion Conservation scores below zero represent alignable regions that are less conserved than expected, the opposite for scores above zero Minimum MCS length is 25 bases Sequence conservation detected with more diverged species (with higher neutral substitution rates) is weighted more heavily Measures conservation with respect to one reference sequence only

  25. Algorithms: Parsimony Parsimony-Based Method Amount of conservation within each column of the alignment is measured using a phylogenetic parsimony score P(i) • P(i) reflects the minimal number of substitutions needed along the branches of an established phylogenetic tree to account for the observed bases at the leaves of the tree Based on P(i) calculates a score under a continuous-time Markov model of neutral evolution, measuring the “surprise” of observing P(i) or smaller parsimony score Requires a phylogenetic tree, a model of neutral substitution

  26. Algorithms: Parsimony Algorithm 1) Calculate the parsimony score P(i) for the i-th position P(i) = the minimum number of substitutions, performed along the branches of the tree, needed to explain the bases observed at the leaves of the tree • notice P(i) is a tight lower bound on the number of substitutions having actually occurred at position i during evolution 2.0) Define a model of neutral evolution • based on the phylogenetic tree T relating the species under study, a neutral substitution rate matrix Q • ℓ(e) denotes the length of branch e, r the root of the tree • transition probability matrix along a branch (u,v) M(u,v) = e ℓ(u,v)Q • background base distribution π This model generates a set of random but related bases at the leaves of the tree by simulating evolution.

  27. Algorithms: Parsimony 2) Define the score assigned to position i based on the 25-base window as • Z(r) is the random variable describing the parsimony score of the bases of the subtree rooted at r • Pr[Z(r) P(j)] is the probability that the parsimony score of the bases at the leaves of T generated by the model defined above is at most P(j) • calculated using a dynamic programming algorithm proceeding from the leaves of T ot its root • if this probability is small, the position is unlikely to have been generated under neutral evolution

  28. Algorithms: Parsimony 3) the final score assigned to position i is 4) For a given treshhold t, position i is predicted to be part of an MCS if • Parsimony-Based Method: Conclusion • Requires a phylogenetic tree, a model of neutral substitution • Produces higher scores based on conservation across large phylogenetic distance

  29. Algorithms: Binomial, Parsimony, Intersecting Intersecting Method • Intersects the results from the Binomial and Parsimony methods • MCSs can be shorter than 25 bp Observations All three methods are biased towards the identification of sequences that are conserved in most species (as opposed to only a subset of species) Conservation score treshhold used was selected such that 5% of the human sequence from the analyzed region falls within an MCS (5% of the human genome is considered to be under active selection)

  30. Concordance of the binomial- and parsimony- based methods for MCS detection

  31. Results: discrimination of different types of sequence using conservation scores

  32. Results General features of detected MCSs • detected virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats) • majority of sequences conserved across multiple vertebrate species has no known function (70% of MCSs reside in non-coding regions) • Uniqueness of the MCSs in the human genome Correlating MCSs with Functional Elements • MCSs correspond to clusters of transcription factor-binding sites, non-coding RNA transcripts, and other candidate functional elements

  33. Results: characteristics of the detected MCSs

  34. Positions of MCSs relative to other annotated genomic features (representative region)

  35. Results Contribution of different species’ sequences to the detection of MCSs • Rodent sequences detect the greatest number of MCS bases, largest number of non-coding sequence • Chicken sequence has considerably higher specificity, largest amount of coding MCS bases • MCSs detected with fish sequences almost exclusively contain coding sequence • Non-human primate sequences are not useful with the applied methods • None of the individual species’ sequences alone came close to identifying all the reference MCS bases • Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

  36. Ability of individual & combinations of species’ sequences to detect MCSs

  37. Outline (The End) Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes • Overview • Data • Global Patterns of Nucleotide Substitution • Rates of Transitions and Transversions in the Rodents • Rates of Neutral Point Substitution • Rates of Microinsertion and Microdeletion • Global Identification of Constrained Elements • Regional Variability of Evolutionary Parameters Identification and Characterization of Multi-Species Conserved Sequences • Overview • Data • Binomial, Parsimony and Intersecting Methods • Stats • Characteristics of the detected MCSs, conclusions

  38. The end

  39. A(u) is the random variable representing the base generated by this random process at node u.

More Related