Phylogenetic Analysis

Phylogenetic Analysis

Subroutine • Some code needs to be reused • A good way to organize code • Called “function” in some languages • Name • Return • Parameters (@_)

#!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>; @DNA = <DNAFILE>; close DNAFILE; $DNA = join('', @DNA); $DNA =~ s/\s//g; $count_of_G = countG($DNA); print $count_of_G; sub countG { my($dna) = @_; my($count) = 0; $count = ( $dna =~ tr/Gg//); return $count; }

#!/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>; @DNA = <DNAFILE>; close DNAFILE; $DNA = join('', @DNA); $DNA =~ s/\s//g; $count_of_G = count($DNA, 'Gg'); print $count_of_G; sub count { my($dna, $pattern) = @_; my($count) = 0; $count = ( eval("$dna =~ tr/$pattern//") ); return $count; }

@fred = (1,2,3); @barney = @fred; @huh = 1; @fred = qw(one two); @barney = (4,5,@fred,6,7); @barney = (8,@barney); ($a,$b,$c) = (1,2,3); @fred = (@barney = (2,3,4)); @fred = @barney = (2,3,4); @fred = (1,2,3); $fred[3] = "hi"; $fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho")

Array operators • push and pop (right-most element) • @mylist = (1,2,3); push(@mylist,4,5,6); • $oldvalue = pop(@mylist); • shift and unshift (left-most element) • @fred = (5,6,7); unshift(@fred,2,3,4); • $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a); • sort: @a = (7,9,9); @b = sort(@a);

Print to file • Open a file to print • open FILE, ">filename.txt"; • open (FILE, ">filename.txt“); • Print to the file • print FILE $str;

#!/usr/bin/perl print "My name is $0 \n"; print "First arg is: $ARGV[0] \n"; print "Second arg is: $ARGV[1] \n"; print "Third arg is: $ARGV[2] \n"; $num = $#ARGV + 1; print "How many args? $num \n"; print "The full argument string was: @ARGV \n";

Parsimony and Tree Reconstruction 简约

Aardvark Bison Chimp Dog Elephant Small vs. Large Parsimony • We break the problem into two: • Small parsimony: Given the topology find the best assignment to internal nodes • Large parsimony: Find the topology which gives best score • Large parsimony is NP-hard • We’ll show solution to small parsimony (Fitch and Sankoff’s algorithms) Input to small parsimony: tree with character-state assignments to leaves Example: A:CAGGTA B:CAGACA C:CGGGTA D:TGCACT E:TGCGTA

Scoring Matrices Small Parsimony Problem Weighted Parsimony Problem

Unweighted vs. Weighted Small Parsimony Scoring Matrix: Small Parsimony Score: 5

Unweighted vs. Weighted Weighted Parsimony Scoring Matrix: Weighted Parsimony Score: 22

Sankoff Algorithm (cont.) • Begin at leaves: • If leaf has the character in question, score is 0 • Else, score is 

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} sA(v) = mini{si(u) + i, A} + minj{sj(w) + j, A} sA(v) = 0 + 9 = 9

Sankoff Algorithm (cont.) st(v) = mini {si(u) + i, t} + minj{sj(w) + j, t} Repeat for T, G, and C

Sankoff Algorithm (cont.) Repeat for right subtree

Sankoff Algorithm (cont.) Repeat for root

Sankoff Algorithm (cont.) Smallest score at root is minimum weighted parsimony score In this case, 9 – so label with T

Sankoff Algorithm: Traveling down the Tree • The scores at the root vertex have been computed by going up the tree • After the scores at root vertex are computed the Sankoff algorithm moves down the tree and assign each vertex with optimal character.

Sankoff Algorithm (cont.) 9 is derived from 7 + 2 So left child is T, And right child is T

Sankoff Algorithm (cont.) And the tree is thus labeled…

Fitch’s Algorithm (cont’d) An example: a c t a {a,c} {t,a} a c t a a a a a {a,c} {t,a} a a a c t t a c

Fitch Algorithm (cont.)

Large Parsimony Problem • Input: An n x m matrix M describing n species, each represented by an m-character string • Output: A tree T with n leaves labeled by the n rows of matrix M, and a labeling of the internal vertices such that the parsimony score is minimized over all possible trees and all possible labelings of internal vertices

Large Parsimony Problem (cont.) • Possible search space is huge, especially as n increases • (2n – 3)!! possible rooted trees • (2n – 5)!! possible unrooted trees • Problem is NP-complete • Exhaustive search only possible with n< 10 • Hence, branch and bound or heuristics used

(2n-3)!! • (2n-5)!!= (2n-5)x(2n-7)x…x3 • For n=10, it is 2,027,025 • For n=13, it is 13,749,310,575 • For n=20, it is 221,643,095,476,699,771,875

Hill Climbing • Start with an arbitrary tree and check its neighbors, swap branches, etc • Move to a tree if it provides the best improvement in parsimony score • No way of knowing if the result is the most parsimonious tree • Could be stuck in local optimum • Methods: NNI, TBR, SPR

Nearest Neighbor InterchangeA Greedy Algorithm • A Branch Swapping algorithm • Only evaluates a subset of all possible trees • Defines a neighbor of a tree as one reachable by a nearest neighbor interchange • A rearrangement of the four subtrees defined by one internal edge • Only three different rearrangements per edge

Nearest Neighbor Interchange (cont.)

Subtree Pruning and RegraftingAnother Branch Swapping Algorithm http://artedi.ebc.uu.se/course/BioInfo-10p-2001/Phylogeny/Phylogeny-TreeSearch/SPR.gif

Tree Bisection and Reconnection Another Branch Swapping Algorithm • Most extensive swapping routine

Problems with current techniques for MP Average MP scores above optimal of best methods at 24 hours across 10 datasets Best current techniques fail to reach 0.01% of optimal at the end of 24 hours, on large datasets

Problems with current techniques for MP Best methods are a combination of simulated annealing, divide-and-conquer and genetic algorithms, as implemented in the software package TNT. However, they do not reach 0.01% of optimal on large datasetsin 24 hours. Performance of TNT with time

Probabilistic modeling and molecular phylogeny

Probabilistic Modeling • Probabilistic models • Describe the likelihood of outcomes • Used to predict future events • Probability distribution • Allocates likelihood to outcomes • Often represented by parameterized functions

Parameter Estimation • Converse of model-based prediction • Take sampled data as given • Estimate the most likely model fitting that data set • Parameter Estimation • Construct model and constraints based on domain knowledge • Solve to find most likely values for parameters of model

Likelihood Function • Define a model by its pdf: • where  are the parameters of the pdf, constant across sampled data • The likelihood function is: • where there are n sets of sample data. • (Note, x and  are often vectors)

Example: Biased coin • n independent coin tosses, k heads • Binomial distribution, parameter p • Given data: 100 trials, 56 heads: • Numerical solution yields max p=0.56 L p

Log Likelihood • Natural log of both sides • d/dp=0

The Jukes-Cantor model of site evolution • Each “site” is a position in a sequence • The state (i.e., nucleotide) of each site at the root is random • The sites evolve independently and identically (i.i.d.) • If the site changes its state on an edge, it changes with equal probability to the other states • For every edge e, p(e) is defined, which is the probability of change for a random site on the edge e.

General Markov (GM) Model • A GM model tree is a pair where • is a rooted binary tree. • , and is a stochastic substitution matrix with • The state at the root of T is random. • GM contains models like Jukes-Cantor (JC), Kimura 2-Parameter (K2P), and the Generalized Time Reversible (GTR) models.

C C A A C A T T G G T G Models of Sequence Evolution Kimura General Jukes Cantor 2a a a a a a e a 2a 2a c a d a b a 2a f

Variation across sites • Standard assumption of how sites can vary is that each site has a multiplicative scaling factor • Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)

Special issues • Molecular clock: the expected number of changes for a site is proportional to time • No-common-mechanism model: there is a random variable for every combination of edge and site

Jukes & Cantor’s one-parameter model Assumption: • Substitutions occur with equal probabilities among the four nucleotide types.

A G   C T  -3 The Jukes-Cantor model (1969) We need to develop a formula for DNA evolution via Prob(y | x, t) where x and y are taken from {A, C, G, T} and t is the time length. Jukes-Cantor assumes equal rate of change:

If the nucleotide residing at a certain site in a DNA sequence is A at time 0, what is the probability, PA(t),that this site will be occupied by A at time t? Since we start with A, PA(0) = 1. At time 1, the probability of still having A at this site is where 3 is the probability of A changing to T, C, or G, and 1 – 3 is the probability that A has remained unchanged.

Phylogenetic Analysis