Biology 4900

Biology 4900 Biocomputing

Chapter 7 Molecular Phylogeny and Evolution

Outline • Introduction to evolution and phylogeny • Nomenclature of relationship trees • Five stages of molecular phylogeny: • Selecting sequences • Multiple sequence alignment • Models of substitution • Tree-building • Tree evaluation

Historical background: Evolution • Evolution: Theory that groups of organisms change over time; descendants are structurally and functionally different than ancestors. • Darwin (1859) proposed the role of natural selection in hereditary change. • Heredity is generally conservative: changes in offspring tend to be small. • Prior to 1950’s, phylogeny was studied by comparing living species with fossils. • Modern techniques of molecular biology not available! Pevsner, Bioinformatics and Functional Genomics, 2009; http://www.ucsd.tv/evolutionmatters/lesson2/study.shtml

Tree of Life http://www.allvoices.com/contributed-news/4553607-is-chimps-as-smart-as-human; After Pace NR (1997) Science 276:734; http://en.wikipedia.org/wiki/File:E_coli_at_10000x,_original.jpg

3 main mechanisms of evolutionary change • Conditions of growth affect development of organism. • Physiological response to external stimuli • One or more differences in an organism that increase it’s likelihood of surviving in an environment long enough to reproduce • Change resulting from sexual reproduction. • Offspring contains genetic material from 2 parents. • Mutation with selection/genetic drift. • Similar to first mechanism Pevsner, Bioinformatics and Functional Genomics, 2009

Historical background • Studies of molecular evolution began with the first sequencing of proteins, beginning in the 1950s. • In 1953 Frederick Sanger and colleagues determined the primary amino acid sequence of insulin. • (The accession number of human insulin is NP_000198) Pevsner, Bioinformatics and Functional Genomics, 2009; http://en.wikipedia.org/wiki/Frederick_Sanger

Historical background: insulin • By the 1950s, it became clear that amino acid substitutions occur non-randomly. • For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region. • Such differences are called “neutral” changes (Kimura, 1968; Jukes and Cantor, 1969). • Subsequent studies at the DNA level showed that rate of nucleotide (and of amino acid) substitution is about six- to ten-fold higher in the C peptide, relative to the A and B chains. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 219

Note the sequence divergence in the disulfide loop region of the A chain Fig. 7.3 Page 220

0.1 x 10-9 1 x 10-9 0.1 x 10-9 Fig. 7.3 Page 220 Number of nucleotide substitutions/site/year

Historical background: insulin Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulin from other species. Why? The answer: Guinea pig and coypu insulin do not bind two zinc ions, while insulin molecules from most other species do. There was a relaxation on the structural constraints of these molecules, and so the genes diverged rapidly. Sus scrofa insulin (1ZNI.pdb) Page 219

Guinea pig and coypu insulin have undergone an extremely rapid rate of evolutionary change Arrows indicate positions at which guinea pig insulin (A and B chains) differs from human and mouse Pevsner, Bioinformatics and Functional Genomics, 2009, p. 220

Historical Background: Vasopressin and Oxytocin • Vasopressin and oxytocin also sequenced in 1950’s. • Sequences differ by only 2 residues, but functions differ significantly • Vasopressin • Reduces the volume of urine formed in the body to retain water • Oxytocin • Causes contractions in smooth muscles of uterus Seager SL, Slabaugh MR, Chemistry for Today: General, Organic and Biochemistry, 7th Edition, 2011

Molecular clock hypothesis • In 1960s, sequence data accumulated for small, abundant proteins (e.g., globins, cytochromes c, fibrinopeptides). • Some proteins appeared to evolve slowly, while others evolved rapidly. • Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock: • For every given protein, rate of molecular evolution is approximately constant in all evolutionary lineages. Mutations over time Species A Protein X Species B Species C Species D Pevsner, Bioinformatics and Functional Genomics, 2009, p. 221

Molecular clock hypothesis • Richard Dickerson (1971) plotted sequence data from three protein families: cytochrome c, hemoglobin, and fibrinopeptides. • X-axis shows the divergence times of the species, estimated from paleontological data. • Y-axis shows m, the corrected number of amino acid changes per 100 residues. • n is observed number of amino acid changes per 100 residues, and it is corrected to m to account for changes that occur but are not observed. corrected amino acid changes per 100 residues (m) Millions of years since divergence Dickerson, 1972

Molecular clock hypothesis

Molecular clock hypothesis: conclusions • Dickerson drew the following conclusions: • For each protein, the data lie on a straight line. Thus, • the rate of amino acid substitution has remained • constant for each protein. • The average rate of change differs for each protein. • The time for a 1% change to occur between two lines • of evolution is 20 MY (cytochrome c), 5.8 MY • (hemoglobin), and 1.1 MY (fibrinopeptides). • The observed variations in rate of change reflect • functional constraints imposed by natural selection. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 223

Molecular clock for proteins: rate of substitutions per aa site per 109 years Fibrinopeptides 9.0 Kappa casein 3.3 Lactalbumin 2.7 Serum albumin 1.9 Lysozyme 0.98 Trypsin 0.59 Insulin 0.44 Cytochrome c 0.22 Histone H2B 0.09 Ubiquitin 0.010 Histone H4 0.010 Table 7-1 Page 223

Molecular clock hypothesis: problems • Rate of molecular evolution varies among different organisms (e.g., viral sequences changes very rapidly). • Clock varies among different genes and parts of genes – guided in part by selection (e.g., animals with shorter generational times may have faster clocks). • Clock only applicable when gene retains its function over evolutionary time. • Genes may become nonfunctional • Rate of evolution may accelerate following gene duplication Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Rate of nucleotide substitution r and time of divergence T • Rate of nucleotide substitution (r) = number of nucleotide substitutions that occur per site per year (same for AAs) • Time of divergence (T) = how long ago the two sequences split from a common sequence. • A constant (K) defines the number of non-synonymous substitutions per site. • α-globins from rat and human differ by 0.093 non-synonymous substitutions by site • Rate of change = 0.58 × 10-9 nonsynonymous nucleotide substitutions per site per year Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Neutral theory of molecular evolution • Significant amount of DNA polymorphism across all species that is difficult to account for by conventional natural selection. • Kimura (1969, 1983) proposed an alternative model to Darwinian evolution at DNA level. • Observed that rate of AA substitution averages ~1 change per 28 × 106 years for proteins of 100 residues. • Corresponding rate of nucleotide substitution is 1 base pair in population genome every 2 years. • Kimura concluded that most DNA substitutions must be neutral and that main cause of evolutionary change at molecular level is random drift of mutant alleles. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 2225

Molecular phylogeny: nomenclature of trees • Molecular Phylogeny: Study of evolutionary relationships among organisms or molecules • Determined via molecular biology • There are two main kinds of information inherent to any tree: topology and branch lengths. • We will now describe the parts of a tree. Page 231

2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Molecular phylogeny uses trees to depict evolutionary relationships among organisms. These trees are based upon DNA and protein sequence data. Cladogram Phylogram A 2 1 1 B 2 C 2 2 1 D 6 one unit E Change in Time Change in Magnitude Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

A F G B I H C D E time Tree Nomenclature Node (intersection or terminating point of two or more branches) Branch (edge) connecting 2 nodes Fig. 7.8 Page 232

A F G B I H C D E time Tree Nomenclature: Nodes OTUs: Taxa that provide observable features (e.g., existing protein sequences, morphological features) Root Node Internal Node (Taxon) Operational taxonomic unit (OTU) Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Tree Nomenclature: Cladogram vs. Phylogram Branches are unscaled... Branches are scaled... A 2 Cladogram Phylogram 1 1 B 2 C 2 2 1 D 6 one unit E …OTUs are neatly aligned, and nodes reflect time …branch lengths are proportional to number of amino acid changes Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

2 A F 1 1 G B 2 I H 2 C 1 D 6 E time Tree nomenclature: Internal Nodes bifurcating internal node multifurcating internal node A 2 1 B 2 C 2 2 1 D 6 one unit E Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Tree nomenclature: clades Clade: Group of taxa derived from a common ancestor Clade ABF (monophyletic group) Clade ABF/CDH/G (paraphyletic) A 2 2 A F F 1 1 1 1 B G G B 2 2 I I H 2 H 2 C C 1 1 D D 6 6 E E Clade CDH (monophyletic group) Pevsner, Bioinformatics and Functional Genomics, 2009, p. 232

Tree nomenclature: Newick Format (,,(,)); no nodes are named (A,B,(C,D)); leaf nodes are named (A,B,(C,D)E)F; all nodes are named (:0.1,:0.2,(:0.3,:0.4):0.5); all but root node have a distance to parent (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; all have a distance to parent(A:0.1,B:0.2,(C:0.3,D:0.4):0.5); distances and leaf names(popular) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; distances and all names((B:0.2,(C:0.3,D:0.4)E:0.5)F:0.1)A; a tree rooted on a leaf node(rare) http://en.wikipedia.org/wiki/Newick_format

Examples of clades Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10

Tree Roots • The root of a phylogenetic tree represents the common ancestor of the sequences. Some trees are unrooted, and thus do not specify the common ancestor. past 9 1 5 7 8 6 7 8 2 3 present 4 2 6 4 5 3 1 Unrooted tree Rooted tree (specifies evolutionary path) Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233

Tree Roots • A tree can be rooted using an outgroup (that is, a taxon known to be distantly related from all other OTUs). past root 9 10 7 8 7 9 6 8 2 3 2 3 4 present 4 6 Outgroup (used to place the root) 5 1 5 1 Rooted tree Pevsner, Bioinformatics and Functional Genomics, 2009, p. 233

Enumerating trees Cavalii-Sforza and Edwards (1967) derived the number of possible unrooted trees (NU) for n OTUs (n> 3): NU = The number of bifurcating rooted trees (NR) NR = For 10 OTUs (e.g. 10 DNA or protein sequences), the number of possible rooted trees is  34 million, and the number of unrooted trees is  2 million. Many tree-making algorithms can exhaustively examine every possible tree for up to ten to twelve sequences. (2n-5)! 2n-3(n-3)! (2n-3)! 2n-2(n-2)! Page 235

Numbers of possible trees extremely large for >10 sequences Number Number of Number of of OTUs rooted trees unrooted trees 2 1 1 3 3 1 4 15 3 5 105 15 10 34,459,425 105 20 8 x 1021 2 x 1020 Pevsner, Bioinformatics and Functional Genomics, 2009, p. 235

Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243

Stage 1: Use of DNA, RNA, or protein DNA • DNA lets you study synonymous and nonsynonymous changes • Synonymous changes do not have corresponding protein changes. • synonymous substitution rate (dS) • nonsynonymous substitution rate (dN) • If dS > dN, DNA sequence is under negative (purifying) selection. • This limits change in the sequence (e.g. insulin A chain). • If dS < dN, positive selection occurs. • Ex. Duplicated gene may evolve rapidly to assume new functions. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 240

Stage 1: Use of DNA, RNA, or protein DNA • Some substitutions in a DNA sequence alignment can be directly observed: Pevsner, Bioinformatics and Functional Genomics, 2009, p. 241, Figure 7.16

Stage 1: Use of DNA, RNA, or protein DNA • Noncoding regions (such as 5′ and 3′ untranslated regions of genes, or introns) may be analyzed using molecular phylogeny. • Pseudogenes (nonfunctional genes) are studied by molecular phylogeny • Rates of transitions and transversions can be measured. • Transitions: purine (A → G) or pyrimidine (C → T) substitutions • Transversion: purine ↔ pyrimidine Pevsner, Bioinformatics and Functional Genomics, 2009, p. 242

Stage 1: Use of DNA, RNA, or protein Protein • Proteins frequently more informative than DNA in pairwise alignment and in BLAST searching. • Proteins have 20 states (amino acids) instead of only 4 for DNA, so there is a stronger phylogenetic signal. • Nucleotides are unordered characters: any one nucleotide can change to any other in one step. • An ordered character must pass through one or more intermediate states before reaching the final state. • Amino acid sequences are partially ordered character states: there is a variable number of states between the starting value and the final value. Pevsner, Bioinformatics and Functional Genomics, 2009, p. 243

Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation

Stage 2: Multiple sequence alignment Fundamental basis of phylogenetic tree is the MSA. • Confirm that all sequences are homologous • Trees can still be generated even with misalignment, or if a non-homologous sequence is included in the alignment. • Adjust gap creation and extension penalties as needed to optimize the alignment • Restrict phylogenetic analysis to regions of the multiple sequence alignment for which data are available for all taxa (delete columns having incomplete data or gaps. Similar to masking). -----GADEG-YFGPVILAADGEVA--- dlgnvGA-EGDYFGPAI--AEGEVArpl Pevsner, Bioinformatics and Functional Genomics, 2009

Five stages of phylogenetic analysis [1] Selection of sequences for analysis [2] Multiple sequence alignment [3] Selection of a substitution model [4] Tree building [5] Tree evaluation

Stage 4: Tree-building methods: distance • Simplest approach to measuring distances between sequences is to align pairs of sequences, and then to count the number of differences. • Degree of divergence is called the Hamming distance. • For alignment of length N with n sites at which there are differences, the degree of divergence D is: D = n / N • Observed differences do not equal genetic distance! • Genetic distance involves mutations that are not observed directly. Pevsner, Bioinformatics and Functional Genomics, 2009

Stage 4: Tree-building methods: distance Jukes and Cantor (1969) proposed a corrective formula: • This model describes the probability that one nucleotide will change into another. • Still imperfect, as it assumes that each residue is equally likely to change into any other (i.e. the rate of transversions equals the rate of transitions). • In practice, the transition is typically greater than the transversion rate. Pevsner, Bioinformatics and Functional Genomics, 2009

Stage 4: Tree-building methods: distance Jukes and Cantor (1969) proposed a corrective formula: Consider an alignment where 3/60 aligned residues differ. The normalized Hamming distance is 3/60 = 0.05. The Jukes-Cantor correction is When 30/60 aligned residues differ, the Jukes-Cantor correction is more substantial: Pevsner, Bioinformatics and Functional Genomics, 2009

Models of nucleotide substitution transition > transversion transition A G purines transversion transversion T C pyrimidines transition Pevsner, Bioinformatics and Functional Genomics, 2009

Stage 4: Tree-building methods Distance-based methods • Distance-based methods involve a distance metric, such as the number of amino acid changes between the sequences, or a distance score. • Examples of distance-based algorithms are UPGMA and neighbor-joining. Character-based methods • Include maximum parsimony and maximum likelihood. • Parsimony analysis involves the search for the tree with the fewest amino acid (or nucleotide) changes that account for the observed differences between taxa. Pevsner, Bioinformatics and Functional Genomics, 2009

1 2 3 4 5 Tree-building methods: UPGMA UPGMA (unweightedpair group method using arithmetic mean) is based on clustering of sequences Pevsner, Bioinformatics and Functional Genomics, 2009

1 2 3 4 5 Tree-building methods: UPGMA Step 1: compute the pairwise distances of all the proteins. Get ready to put the numbers 1-5 at the bottom of your new tree. Pevsner, Bioinformatics and Functional Genomics, 2009

1 2 3 4 5 Tree-building methods: UPGMA Step 2: Find the two proteins with the smallest pairwise distance. Cluster them. 6 1 2 Pevsner, Bioinformatics and Functional Genomics, 2009

Biology 4900

Biology 4900

Presentation Transcript

ADMS 4900 Class 7

Student Design Competition Capstone Design ME 4900, 4902.01, 4902.02

Developmental Biology – Biology 4361

Vital Signs Monitor UConn BME 4900

Biology / Biology H

4900 S ROUTE 31 CRYSTAL LAKE 1

63 g 50,000 mg 0.08 kg 4900 cg 420 dg

Biology 4900

Biology 4900

Biology 4900

Biology 4900

Biology 4900

4900 Project

Biology 4900

CS 4900-020 Software Testing Fall 2009 Project Title

ME 4900 Intro. to Design Studies

AH Biology Environmental Biology

Biology 156 – Plant Biology

Biology 129 Human Biology

Marketing Planning &amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning &amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Biology 4900

Biology 4900

Presentation Transcript

ADMS 4900 Class 7

Student Design Competition Capstone Design ME 4900, 4902.01, 4902.02

Developmental Biology – Biology 4361

Vital Signs Monitor UConn BME 4900

Biology / Biology H

4900 S ROUTE 31 CRYSTAL LAKE 1

63 g 50,000 mg 0.08 kg 4900 cg 420 dg

Biology 4900

Biology 4900

Biology 4900

Biology 4900

Biology 4900

4900 Project

Biology 4900

CS 4900-020 Software Testing Fall 2009 Project Title

ME 4900 Intro. to Design Studies

AH Biology Environmental Biology

Biology 156 – Plant Biology

Biology 129 Human Biology

Marketing Planning &amp;amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning &amp;amp; Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]

Marketing Planning & Problem Solving [Dr. Carter; MKTG. 4900 ]