Phylogenetic Analysis

Phylogenetic Analysis YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM

Phylogenetics What do I need to do? Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly

Phylogenetics Get related sequences of interest Perform multiple sequence alignments Edit alignment Estimate phylogenetic relationships Interpret results correctly

So you have a sequence…now what? MKILLLCIIFLYYVNAFKNTQKDGVSLQILKKKRSNQVNFLNRKNDYNLIKNKNPSSSLKSTFDDIKKIISKQLSVEEDKIQMNSNFTKDLGADSLDLVELIMALEEKFNVTISDQDALKINTVQDAIDYIEKNNKQ

#1: What is it? Does source organism have it’s own genome database? Unknown/No Yes BLAST@ genome database(GeneDB, PlasmoDB, etc.) BLAST@ Pubmed

Why start with genome-specific database? Genome location/structure Strain variability BLAST Expression data Pathway data

PubMed BLAST

Blastp PubMed BLAST

Protein families – Conserved Domains

BLAST Hits

Downloading sequences – FASTA format

Getting sequences – FASTA format

Saving and editing FASTA files

GYTSLLLSRQNED--G G--SLLLSHK-D-HTG Global GYTSLLLSRQNEDG-- --GSLLLSHK-D-HTG Overlap TSLLLSR TSLLLSH Local Pair-wise sequence alignment Smith-Waterman

- Y T S L L L S R Q - Y A S L L W R Q A YTSLLLSRQ YASLLWRQA YTSLLLSRQ- YASLLW-RQA Aligning 2 sequences globally -4 -8 -12 -16 -20 -24 -28 -32 -36 -8 -12 -16 -20 -24 -28 -32 -36 -4 4 -4 2 -12 -16 -20 -24 -28 -32 -36 -8 -12 -4 -8 10 -16 -20 -24 -28 -32 -36 -4 -8 -12 14 -20 -24 -28 -32 -36 -16 -20 -4 -8 -12 -16 18 14 10 -32 -36 -19 -8 -12 -16 -20 14 10 6 -36 -24 -28 -4 -20 -12 -16 -20 -24 -28 15 11 -25 -29 -24 -16 -20 -24 -28 -32 20 -32 16 -36 -26 -25 -34 -25 -35 -28 -28 -32

YTSLLLSRQ- YASLLW-RQA YTSLLLSRQ- YASLLW-RQA PASIILSRQA YTSLLLSRQ- YASLLW-RQA PASIILSRQA GRSIVLTRQM Multiple sequence alignment Progressive Align 2 closest sequences Add in next closest sequence Continue adding…. Hyper dependent on initial matches.

YTTSLLLSRQ-- YATSLLWRQA-- PASIILSRQA-- GRTSIVLTRQMA YTTSLLLSRQ-- YATSLLW-RQ-A PA-SIILSRQ-A GRTSIVLTRQMA Multiple sequence alignment Iterative Initial MSA Score (low) Optimize MSA score Probabilistic methods don’t always generate the same answer

Multiple sequence alignment programs Pair-wise alignment type Global Local ClustalX T-Coffee progressive POA MSA Alignment type HMMs GAs Dialign iterative

Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global

POAVIZ

Multiple Sequence Alignments POAVIZ – progressive local CLUSTAL – progressive global

CLUSTALX Parameters

CLUSTALX

CLUSTALX – Protein Weight Matrices • 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). • 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. • 3) GONNET. These matrices were derived using almost the same procedure as the Dayhoff one (above) but are much more up to date and are based on a far larger dataset.

BLOSUM99 ----------------------------------------------------->BLOSUM62 >99% identity >62% identity BLOSUM (BLOck SUbstitution Matrix) BLOSUM62 – Gather proteins with at least 62% identity to obtain actual substitution rates for these proteins Pros Best bet for distantly divergent sequences

PAM1 ------------------------------------------------------------->PAM250 99% identity 20% identity PAM (point accepted mutation) Gather the substitution rates for PAM1 (99% identical sequences) Assuming that those substitution rates are consistent over time…: (# Point mutations / 100 amino acids) Pros Very good for closely related sequences Cons Rare mutations under-represented Substitution rates not constant over time (both are problems for phylogenetic estimation)

CLUSTALX

CLUSTALX - Aligning

CLUSTALX – Alignment view

CLUSTAL vs POAVIZ (global vs local) POAVIZ CLUSTAL

BioEdit – Alignment manipulation Open the “.aln” file

BioEdit – Alignment manipulation “Back colored view” gives more contrast Select “Edit” from the mode dropdown

BioEdit – Alignment manipulation Select “Insert” so that you don’t accidentally lose part of your sequence Then select the unaligned beginning (or end) sequence and delete it….

BioEdit – Alignment manipulation Now save as a different file .fasta

Tree terminology root outgroup common ancestor (node, branch point) lineage (branch, edge) branch length B C D E F G A Operational taxonomic units (OTUs, leaves)

Topology 1 B C D E F G A Topology 2 B C E F G D A Topology 3 E F G C D B A monophyletic paraphyletic polyphyletic

A A B B Sequence homology – orthologues and paralogues Ancestral gene duplication A B Last common ancestor speciation Human A Rat A Human B Rat B orthologues orthologues paralogues orthologues paralogues

Methods of estimating phylogenetic relationships Character-based Maximum Parsimony (MP)Distance-based Neighbor-Joining (NJ) Minimum Evolution (ME)Probabilistic Maximum likelihood (ML) Bayesian inference

Taxa1 AAG Taxa2 AAA Taxa3 GGA Taxa4 AGA 1 AAA AAA AAA AAA AGA AAA AAA AAA AAA 1 1 2 1 2 1 1 1 AAG AAA GGA AGA AAG AGA AAA GGA AAG GGA AAA AGA 3 changes required (best tree) 4 changes required 4 changes required Methods of estimating phylogenetic relationships Maximum Parsimony (MP)

Methods of estimating phylogenetic relationships Distance-based Neighbor-Joining (NJ) MethodThe NJ method involves clustering of neighbor species that are joined by one node. It does not evaluate all the possible tree topologies. Not guaranteed to obtain the optimal tree Minimum Evolution (ME) MethodEstimates the total branch length of each topology exhaustively, then chooses the topology with the least total branch length. Time intensive for large numbers of taxa.

Methods of estimating phylogenetic relationships Probabilistic methods Maximum likelihood (ML) Prob ( data | model + tree ) More likely topology found Search all possible topologies to optimize probability

Bayesian inference Prior information Model for selection need both for everyone in the class

Phylogenetic Analysis