Inferring phylogenetic trees

Inferring phylogenetic trees Prof. William Stafford Noble Department of Genome SciencesDepartment of Computer Science and Engineering University of Washington thabangh@gmail.com

One-minute responses • I did not understand anything in the Gibbs sampling and the second method. • The class was quite OK now. Understood most important things. • I understood 50% of the Python part. But I am a bit confused about the goal of the programs. • Please send us the slides immediately after lecture. • I put the slides on the website during the Python half of the class. Hit “refresh” on the web browser to see them. • I didn’t understand clearly converting scores to p-values, more especially putting 1 and 2. Otherwise everything was clear. • I think we should go a little bit slower. • I didn’t understand the EM and Gibbs. • The concept of EM and Gibbs sampling are really very important. Please go in depth on them. • Python sessions are still fine as usual. • These algorithms are complex. Could you please explain them with a bit of some examples? • I didn’t understand the second Python problem. • Emile must not mark our assessment on the programming part.

Revision - Gibbs Randomly select sequences Motif occurrences Scan discarded sequence with PSSM Choose new occurrence according to resulting probabilities • Randomly discard one sequence • Build PSSM from remaining sequences • Counts • Add pseudocounts • Normalize PSSM

Revision - EM Randomly select sequences Motif occurrences Scan each sequence with PSSM Take top-scoring occurrence Counts Add pseudocounts Normalize Divide by background Take log2 PSSM

Phylogenetic inference Rabbit Dove Lion Donkey ?

Outline • Parsimony • Distance methods • Computing distances • Finding the tree • Maximum likelihood

Selecting a method Choose set of related sequences Obtain multiple sequence alignment Is there strong sequence similarity? Yes Maximum parsimony methods No Is there clearly recognizable sequence similarity Yes Distance methods No Maximum likelihood methods

Maximum parsimony Enumerating these trees can take a very long time for each possible tree compute the parsimony score return the tree with the best score Computing this score is straightforward

How many trees? • With four sequences: 3 unrooted trees • With five sequences: 15 unrooted trees. • With seven sequences: 954 unrooted trees. 1 3 1 2 1 3 2 3 4 4 4 2

Computing parsimony scores Scer = A Smik = A Spar = G Skud = A

Computing parsimony scores Scer = A Smik = A A A Spar = G Skud = A Score = 1

Computing parsimony scores Scer = A Scer = A Smik = A Smik = A A A Scer = A A A Spar = G A A Spar = G Skud = A Skud = A Spar = G Score = 1 Score = 1 Smik = A Skud = A Score = 1 This site is uninformative, because all the trees have the same score.

Computing parsimony scores Scer = Scer = Smik = Smik = Scer = Spar = Spar = Skud = Skud = Spar = Score = ? Score = ? Smik = Skud = Score = ?

Computing parsimony scores Scer = G Scer = G Smik = A Smik = A G A Scer = G G G Spar = G G G Spar = G Skud = T Skud = T Spar = G Score = 2 Score = 2 Smik = A Skud = T Score = 2

Computing parsimony scores Scer = Scer = Smik = Smik = Scer = Spar = Spar = Skud = Skud = Spar = Score = ? Score = ? Smik = Skud = Score = ?

Computing parsimony scores Scer = A Scer = A Smik = T Smik = T A T Scer = A A A Spar = A A A Spar = A Skud = T Skud = T Spar = A Score = 2 Score = 1 Smik = T Skud = T Score = 2 This tree is best.

Computing parsimony scores Scer Smik Total = 26 Spar Skud

Computing parsimony scores Total = 28 Scer Spar Smik Skud

Parsimony software • In general, the most widely used programs for phylogenetic analysis are • Phylip (Joe Felsenstein) • PAUP (Jim Swofford) • MacClade (David and Wayne Maddison) • All three do parsimony. Only Phylip is free.

Previous one-minute responses • How many sequences are usually analyzed by parsimony methods? • Exhaustively, probably tens of sequences. With heuristic search methods, you can analyze arbitrarily many, but you lose the guarantee that you’re finding the most parsimonious tree. • What do good parsimony scores look like? • It depends upon how many sequences are involved, and how divergent they are. • Why doesn’t the parsimony method take into account transitions versus transversions? • It can; I presented the simplest version.

Jukes-Cantor model • Assume the same probability of change at all positions and all times. • dAB is the proportion of changed sites in the alignment. • KAB is the distance between sequences A and B.

Problem #1 • Write a program jukes-cantor.py that takes as input a pairwise sequence alignment and prints the Jukes-Cantor distance. Skip sites that contain gaps. > cat twoseqs.txt ACGT ACCG > python jukes-cantor.pytwoseqs.txt 0.823959

Problem #2 • Generalize your previous program to work for a multiple sequence alignment. > cat threeseqs.txt ACGT ACTG ACGG > python jukes-cantor-matrix.pythreeseqs.txt 0.000 0.824 0.304 0.824 0.000 0.304 0.304 0.304 0.000 > jukes-cantor-multiple.pymoreseqs.txt 0.000 0.233 0.383 0.233 0.233 0.000 0.824 0.572 0.383 0.824 0.000 0.107 0.233 0.572 0.107 0.000

Inferring phylogenetic trees

Inferring phylogenetic trees

Presentation Transcript

Phylogenetic Trees

Phylogenetic Trees

PHYLOGENETIC TREES

Inferring phylogenetic trees: Maximum likelihood methods

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic trees

Phylogenetic trees

Inferring phylogenetic trees: Distance methods

Phylogenetic Trees

Phylogenetic trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees

Phylogenetic trees

Phylogenetic Trees

Phylogenetic trees

Inferring phylogenetic trees: Distance and maximum likelihood methods

Phylogenetic Trees

Phylogenetic Trees

Phylogenetic Trees

Parsimony-Based Approaches to Inferring Phylogenetic Trees