530 likes | 618 Vues
Phylogeny – data mining by biologists. Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences. Understanding our relationships. Trees are like mobiles. The language of trees. Changes can occur.
E N D
Phylogeny – data mining by biologists • Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences
The why and what of natural selection • Variation exists at the DNA level: alleles • This variation is inexhaustible (something important to remember when looking at new genome sequences) • These differences are subjected to selection: • Changes in protein structure are typically unfavorable and as a result, selected against • However, some changes in structure/function are selected for: sickle cell anemia/malaria
Neutral Theory of Evolution - Kimura • Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection • Compare Fugu with humans..most conserved sequences are the genes • http://www.sciencemag.org/cgi/content/full/297/5585/1301 • Synonymous substitutions and substitutions in pseudogenes (define) are thought to be reflective of actual mutation rate operating with a genome (no selection) • Is this accurate?
Genetic drift • Random genetic drift is a stochastic process (by definition). • One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. • Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation
Where do substitutions occur? • Non-coding regions exhibit a substitution rate 2X greater than coding regions • Coding regions are more “functionally constrained” • Higher degeneracy of codon, higher substitution rate observed • A thought: Coding sequences – sequence constraint; Non-coding sequence – structure constraint???
Natural variants • Site-directed mutagenesis studies of a single gene will give way to comparative genomic studies derived from the abundance of sequence data • As a result, it is important to understand molecular evolution and models describing this process
The relationship between time and substitutions is non-linear
Observing differences in nucleotides • The simplest measure of distance between two sequences is to count the # of sites where the two sequences differ – called p-distance • If all sites are not equally likely to change, the same site may undergo repeated substitutions • As time goes by, the number of differences between two sequences becomes less and less an accurate estimator of the actual number of substitutions that have occurred
So what is phylogeneticsgood for? Phylogenetics has direct applications to: • Conservation: test wood, ivory, meat products for poaching • Agriculture: analyze specific differences between cultivars • Forensics: DNA fingerprinting • Medicine: determine specific biochemical function of cancer-causing genes
Sequence A Sequence B Sequence C Sequence D Sequence E Phylogenetic concepts:Interpreting a Phylogeny Which sequence is most closely related to B? A, because B diverged from A more recently than from any other sequence. Physical position in tree is not meaningful! Only tree structure matters. Time
Rooted vs. unrooted • Root – ancestor of all taxa considered • Unrooted – relationship without consideration of ancestry • Often specify root with outgroup • Outgroup – distantly related species (ie. mammals and an archaeal species)
A A A B B ? ? X ? X B = = Root Root ? C ? ? D D C C D Time Phylogenetic concepts:Rooted and Unrooted Trees
Evolutionary trees measure time. Phylograms measure change. sharks seahorses seahorses sharks frogs owls frogs Root Root owls crocodiles crocodiles armadillos 5% change armadillos 50 million years bats bats Tree Types
Ultrametricity All tips are an equal distance from the root. Additivity Distance between any two tips equals the total branch length between them. X X a a Y b b e Y e c c d d Root Root a = b + c + d + e XY = a + b + c + d + e Tree Properties In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.
Tree building • Get protein/RNA/DNA sequences • Construct multiple sequence alignment • Compute pairwise distances (if necessary) • Build tree – topology and distances • Estimate reliability • Visualize
Various models have been generated to more accurately estimate distance and evolution • All use the following framework: Probability matrix pAC is the probability of a site starting with an A had a C at the end of time interval t, etc. Base composition of sequence; fa = frequency of A
Phylogenetic Methods Many different procedures exist. Three of the most popular: Neighbor-joining • Minimizes distance between nearest neighbors Maximum parsimony • Minimizes total evolutionary change Maximum likelihood • Maximizes likelihood of observed data
Which procedure should we use? Neighbor- joining ? Maximum parsimony Maximum likelihood All that we can! • Each method has its own strengths • Use multiple methods for cross-validation • In some cases, none of the three gives the correct phylogeny!
Jukes-Cantor Model • Distance between any two sequences is given by: d = -3/4 ln(1-4/3p) • p is the proportion of nucleotides that are different in the two sequences • All substitutions are equally probable • Each position in matrix = a; except diagonal = 1-Sa
Kimura’s two parameter model • d = ½ ln[1/(1-2P-Q)] + ¼ ln[1/1-2Q)] • P and Q are proportional differences between the two sequences due to transitions and transversions, respectively. • Accounts for transition bias in sequences (transversions more rare)
Distances in Amino acid sequences • Account for synonymous and non-synonymous changes in respective codons • Pathways to double mutations
Dealing with multiple substitutions • Unweighted method – pathways are equally likely • Weighted – favor synonymous changes • Degeneracy classifications • Nondegenerate (0) – First two positions of TTT (Phe) • Two-fold degenerate (2) – Third position of TTT (Phe) • Four fold degenerate (4) – Third position of GTT (Val)
Trees are hypotheses about evolutionary history So far, we’ve looked at understanding and formulating these hypotheses. Now, let’s turn our attention to testing them.
Testing the reliability of trees • Interior branch test or Bootstrap analysis • Bootstrap analysis – subsequences or sequence deletion or replacement; re-draw trees; how many times do you get some branching? Bootstrap values of 70 (95) or greater are normally considered reliable
A A A C B B D D C C B D Tree Testing:Split Decomposition Split decomposition is one method for testing a tree. Under this procedure, we choose exactly four taxa (A, B, C, D) and examine the topologies of all possible unrooted trees. How many such trees are there? Only one of these topologies is right. How can we quantitatively assess the support for each tree?
A B if A B A B – + + C D is the right phylogeny! D D C C = 2 Large split indices Long internal branch Topology strongly supported Small split indices Short internal branch Topology weakly supported Negative split indices Biologically impossible Topology probably wrong Tree Testing:Split Decomposition The correct tree should be approximately additive; the others usually will not. For each tree, we calculate split indices that estimate the length of the internal branch:
Used to assess the support for individual branches Randomly resample characters, with replacement rat Repeat many times (1000 or more) human How often does a specific branch appear? turtle fruit fly 100 oak 73 duckweed 98 Tree Testing:Bootstrapping
Rates of nucleotide substitutions between human and mouse or rat • Synonymous rate = 2-10 substitutions per site per 109 years in coding regions • Nonsynonymous rate = 0-3 substitutions per site per 109 years in coding regions (more variable among genes) • Synonymous rate exceeds nonsynonymous rate
Molecular Clocks • Do homologous proteins evolve at the same substitution rate? • Estimate relative rates using an outgroup • But, what about effects of generation time, metabolic specialization, etc?
Darwin’s theory reinterpreted homology as common ancestry. Ancestral sequence ATCGGCCACTTTCGCGATCA ATAGGCCACTTTCGCGATCA ATCGGCCACTTTCGCGATCG ATAGGCCACTTTCGCGATTA ATCGGCCACTTTCGTGATCG ATAGGGCAGTTTCGCGATTA ATCGGCCACGTTCGTGATCG ATAGGGCAGTTTTGCGATTA ATCGGCCACGTTCGCGATCG ATCGGCCACCTTCGCGATCG ATAGGGCAGTTTCGCGATTA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Homologous sequences ACCGGCCACCTTCGCGATCG ATAGGGCAGTCTCGCGATTA
Orthologs arise by speciation Speciation event Sequence in ancestral Organism ATCGGCCACTTTCGCGATCA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Orthologous sequences Modern species B Modern species A Orthologs are “evolutionary counterparts” – Koonin (2001)
Paralogs arise by duplications Duplication event Sequence in ancestral Organism ATCGGCCACTTTCGCGATCA ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG Paralogous sequences Modern duplicate B Modern duplicate A
We have different types of hemoglobins The major adult hemoglobin is composed of 2 a chains and 2 b chains. The major fetal hemoglobin is composed of 2 a chains and 2 g chains. Hardison PNAS 2001 98: 1327-1329
“There may thus exist a Molecular Evolutionary Clock” Zuckerkandl & Pauling (1965) p Primordial hemoglobin Duplication event Note: This model explains why the distance betweem Human a and Cow a is shorter than Human a – Human b proximity. b Speciation event a Human a Human b Cow a Cow b A model of sequence divergence can be used to extract the duplication dates of the difference hemoglobin chains
Different clocks keep different times Between horse and man PBS Evolution Library (http://www.pbs.org/wgbh/evolution/library/)
The clock varies for different regions of the protein For example, locations on the exterior of the protein may change at a different rate than those on the interior.
No universal clocks found! Two terrible clocks Ayala, F. Bioessays 1999 Jan;21(1):71-5
The common estimate is 1,100 My Ayala, F. Bioessays 1999 Jan;21(1):71-5
What causes deviations from the clock? • Generation time: Shorter generation time will accelerate the clock because it shortens the time to fix new mutations. • Mutation rate: Species-characteristic differences in polymerases or other biological properties that affect the fidelity of DNA replication, and hence the incidence of mutations. • Gene function: Changes in the function of a protein as evolutionary time proceeds. This might particularly be expected in the case of gene duplication. • Natural selection: Organisms are continually adapting to the physical and biotic environments, which change endlessly in patterns that are unpredictable and differently significant to different species. Ayala, F. Bioessays 1999 Jan;21(1):71-5