Biology 224 Instructor: Tom Peavy March 13 & 18, 2008

Multiple Sequence Alignment Biology 224 Instructor: Tom Peavy March 13 & 18, 2008 <Images adapted from Bioinformatics and Functional Genomics by Jonathan Pevsner>

Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • Homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense

Multiple sequence alignment: properties • not necessarily one “correct” alignment of a protein family • protein sequences evolve... • ...the corresponding three-dimensional structures of proteins also evolve • may be impossible to identify amino acid residues that align properly (structurally) throughout a multiple sequence alignment • for two proteins sharing 30% amino acid identity, about 50% of the individual amino acids are superposable in the two structures

Multiple sequence alignment: features • • some aligned residues, such as cysteines that form • disulfide bridges, may be highly conserved • there may be conserved motifs such as a • transmembrane domain • there may be conserved secondary structure features • there may be regions with consistent patterns of • insertions or deletions (indels)

Multiple sequence alignment: methods • There are two main ways to make • a multiple sequence alignment: • Progressive alignment (Feng & Doolittle). • (e.g. ClustalW) • (2) Iterative approaches.

Use Clustal W to do a progressive MSA http://www2.ebi. ac.uk/clustalw/

Feng-Doolittle MSA occurs in 3 stages [1] Do a set of global pairwise alignments (Needleman and Wunsch) [2] Create a guide tree [3] Progressively align the sequences

Progressive MSA stage 1 of 3: generate global pairwise alignments Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 84 Sequences (1:3) Aligned. Score: 84 Sequences (1:4) Aligned. Score: 91 Sequences (1:5) Aligned. Score: 92 Sequences (2:3) Aligned. Score: 99 Sequences (2:4) Aligned. Score: 86 Sequences (2:5) Aligned. Score: 85 Sequences (3:4) Aligned. Score: 85 Sequences (3:5) Aligned. Score: 84 Sequences (4:5) Aligned. Score: 96 five closely related lipocalins best score

Number of pairwise alignments needed For N sequences, (N-1)(N)/2 For 5 sequences, (4)(5)/2 = 10

Feng-Doolittle stage 2: guide tree • Convert similarity scores to distance scores • A tree shows the distance between objects • Distance methods used (i.e. Neighbor joining) • ClustalW provides a syntax to describe the tree • A guide tree is not a phylogenetic tree

Progressive MSA stage 2 of 3: generate guide tree ((Human RBP:0.04284,(Mouse RBP:0.00075, Rat RBP:0.00423) :0.10542) :0.01900, Pig RBP:0.01924, Bovine RBP:0.01902); 3 (rat RBP) 2 (murine RBP) 4 (porcine RBP) 5 (bovine RBP) five closely related lipocalins 1 (human RBP)

Feng-Doolittle stage 3: progressive alignment • Make a MSA based on the order in the guide tree • Start with the two most closely related sequences • Then add the next closest sequence • Continue until all sequences are added to the MSA • Rule: “once a gap, always a gap”

Clustal W alignment of 5 closely related lipocalins CLUSTAL W (1.82) multiple sequence alignment gi|89271|pir||A39486 MEWVWALVLLAALGSAQAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 50 gi|132403|sp|P18902|RETB_BOVIN ------------------ERDCRVSSFRVKENFDKARFAGTWYAMAKKDP 32 gi|5803139|ref|NP_006735.1| MKWVWALLLLAAW--AAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 gi|6174963|sp|Q00724|RETB_MOUS MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 gi|132407|sp|P04916|RETB_RAT MEWVWALVLLAALGGGSAERDCRVSSFRVKENFDKARFSGLWYAIAKKDP 50 ********************:* ***:***** gi|89271|pir||A39486 EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 100 gi|132403|sp|P18902|RETB_BOVIN EGLFLQDNIVAEFSVDENGHMSATAKGRVRLLNNWDVCADMVGTFTDTED 82 gi|5803139|ref|NP_006735.1| EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 gi|6174963|sp|Q00724|RETB_MOUS EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 gi|132407|sp|P04916|RETB_RAT EGLFLQDNIIAEFSVDEKGHMSATAKGRVRLLSNWEVCADMVGTFTDTED 100 *********:*******.*:************.**:************** gi|89271|pir||A39486 PAKFKMKYWGVASFLQKGNDDHWIIDTDYDTYAAQYSCRLQNLDGTCADS 150 gi|132403|sp|P18902|RETB_BOVIN PAKFKMKYWGVASFLQKGNDDHWIIDTDYETFAVQYSCRLLNLDGTCADS 132 gi|5803139|ref|NP_006735.1| PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 gi|6174963|sp|Q00724|RETB_MOUS PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 gi|132407|sp|P04916|RETB_RAT PAKFKMKYWGVASFLQRGNDDHWIIDTDYDTFALQYSCRLQNLDGTCADS 150 ****************:*******:****:*:* ****** *********

Why “once a gap, always a gap”? • There are many possible ways to make a MSA • Where gaps are added is a critical question • Gaps are often added to the first two (closest) sequences • To change the initial gap choices later on would be • to give more weight to distantly related sequences • To maintain the initial gap choices is to trust • that those gaps are most believable

Multiple sequence alignment to profile HMMs • Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment • HMMs are probabilistic models • Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments

An HMM is constructed from a MSA Example: five lipocalins GTWYA (hs RBP) GLWYA (mus RBP) GRWYE (apoD) GTWYE (E Coli) GEWFS (MUP4)

GTWYA GLWYA GRWYE GTWYE GEWFS Prob. 1 2 3 4 5 p(G) 1.0 p(T) 0.4 p(L) 0.2 p(R) 0.2 p(E) 0.2 0.4 p(W) 1.0 p(Y) 0.8 p(F) 0.2 p(A) 0.4 p(S) 0.2

GTWYA GLWYA GRWYE GTWYE GEWFS P(GEWYE) = (1.0)(0.2)(1.0)(0.8)(0.4) = 0.064 log odds score = ln(1.0) + ln(0.2) + ln(1.0) + ln(0.8) + ln(0.4) = -2.75 E:0.4 A:0.4 S:0.2 T:0.4 L:0.2 R:0.2 E:0.2 Y:0.8 F:0.2 G:1.0 W:1.0

Databases of multiple sequence alignments BLOCKS (HMM) CDD (HMM) DOMO (Gapped MSA) INTERPRO iProClass MetaFAM Pfam (profile HMM library) PRINTS PRODOM (PSI-BLAST) PROSITE SMART

CDD uses RPS-BLAST: reverse position-specific Query = your favorite protein Database = set of many PSSMs CDD is related to PSI-BLAST, but distinct CDD searches against profiles generated from pre-selected alignments Purpose: to find conserved domains in the query sequence You can access CDD via DART at NCBI

Multiple sequence alignment algorithms Local Global CLUSTAL PileUp other Progressive PIMA Iterative DIALIGN SAGA

Multiple sequence alignment programs AMAS CINEMA ClustalW ClustalX DIALIGN HMMT Match-Box MultAlin MSA Musca PileUp SAGA T-COFFEE

Clustal X

GCG PileUp

Assessment of alternative multiple sequence alignment algorithms [1] As percent identity among proteins drops, performance (accuracy) declines also. This is especially severe for proteins < 25% identity. Proteins <25% identity: 65% of residues align well Proteins <40% identity: 80% of residues align well [2] “Orphan” sequences are highly divergent members of a family. Surprisingly, orphans do not disrupt alignments. Also surprisingly, global alignment algorithms outperform local.

Biology 224 Instructor: Tom Peavy March 13 & 18, 2008