Sequence Similarity

Sequence Similarity

xi ― xi yj MATCH PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERT X INSERT Y ― yj

xi yj MATCH INSERT X INSERT Y xi ― ― yj A pair-HMM model of pairwise alignment • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences • Transition probabilities ~ gap penalties • Emission probabilities ~ substitution matrix (from BLOSUM) x ABRACA-DABRA AB-ACARDI--- y

Computing Pairwise Alignments • The Viterbi algorithm • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y • identifies highest probability alignment, αviterbi, in O(L2) time Caveat: the mostlikely alignment is not the mostaccurate • Alternative: find the alignment of maximum expected accuracy P(α) αviterbi P(α | x, y)

4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C The Lazy-Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Approach #1: Use the answer sheet of the best student! • Approach #2: Weighted majority vote!

Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes Eα*[1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes Eα*[accuracy(α, α*)] 4. T A Viterbi vs. Maximum Expected Accuracy (MEA) 4. F 4. F 4. T 4. F 4. F B A- A A- A 4. F 4. F 4. F 4. F 4. T C B- B+ B+ B-

Computing MEA alignments • Define accuracy (α, α*) = Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y) = ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’) = ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’) = ∑(xi, yj) in α P(xi, yj in α’ | x, y) • Define M[i, j] = posterior probability that xi is aligned to yj # of correct predicted matches length of shorter sequence

Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[i, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence

Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[I, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence

The consistency signal zk z xi x y yj yj’

To estimate P(xiyj | x, y, z) Method 1:triplet-HMM P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z) Parameters trained with unsupervised EM Running time: O(N3L3) N: # sequences L: sequence lengths

Probabilistic consistency • Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z) • 2 approaches: • 1) Exact – triplet HMM, O(L3) time • 2) Approximate – use independence assumptions ∑k P(xi ~ zk and zk ~ yj | x, y, z) = ∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk)  (assume indep.) ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement

ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABACARDI ABRACADABRA ABACARDI ABRADABI Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement

Summary of PROBCONS Algorithm • Given K sequences to be aligned, • Compute M[i, j] for all pairs of sequences, x and y • (2) Use probabilistic consistency to reestimate M[i, j] • Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y • Perform progressive alignment along the tree • Score of a column: sum-of-pairs M[i, j] • Apply iterative refinement

Training/testing methodology • 3 reference benchmark sets • PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE. • Quality score: Q(α, α*) = BAliBASE PREFAB SABmark # of correct predicted matches total # of true matches

Evaluation of Algorithm Components all-pairs pairwise multiple

Performance of different alignment tools

Resources for alignment Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994) http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable (2004) http://probcons.stanford.edu/ PROBCONS – most accurate (2004) Some more protein multiple aligners: MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc……. ProbCons: Chuong (Tom) Do

Profile hidden Markov models for sequence famillies

PFAM Protein FAMilies database of alignments • Profile HMMs describe each family • For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures

PFAM Pfam-A – curated multiple alignments • Grows slowly; quality controlled by experts Pfam-B – automatic clustering (ProDom derived) • New sequences instantly incorporated; unchecked • Search by: Sequence, keyword, domain, taxonomy • Browsing by family or genome • Evolutionary tree • Source of seed alignments: • Pfam-B families • Published articles • ‘Domain hunting' studies

Profile HMMs • Each M state has a position-specific pre-computed substitution table • Each I state has position-specific gap penalties (and in principle can have its own emission distributions) • Each D state also has position-specific gap penalties • In principle, D-D transitions can also be customized per position Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

Profile HMMs • transition between match states – αM(i)M(i+1) • transitions between match and insert states – αM(i)I(i), αI(i)M(i+1) • transition within insert state – αI(i)I(i) • transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1) • transition within delete state – αD(i)D(i+1) • emission of amino acid b at a state S – εS(b) Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

Profile HMMs • transition probabilities ~ frequency of a transition in alignment • emission probabilities ~ frequency of an emission in alignment • pseudocounts are usually introduced Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F

Alignment of a protein to a profile HMM To align sequence x1…xn to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm • Define • VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj • VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij • VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj) • Denote by qa the frequency of amino acid a in a ‘random’ protein

Alignment of a protein to a profile HMM Vj-1M(i – 1) + log αM(j-1)M(j) • VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j) Vj-1D(i – 1) + log αD(j-1)M(j) VjM(i – 1) + log αM(j)I(j) • VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j) VjD(i – 1) + log αD(j)I(j) Vj-1M(i) + log αM(j-1)D(j) • VjD(i) = max Vj-1I(i) + log αI(j-1)D(j) Vj-1D(i) + log αD(j-1)D(j)

Weight of each sequence i h g f • One simple weighting scheme is to find how much edge length each leaf contributes • Example: edge 1 belongs to a • Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a Δwi = ecurrent wi / (leaves k below ecurrentwk) e d c 2 b 3 1 a

How to build a profile HMM

Resources on the web • HMMer – a free profile HMM software • http://hmmer.wustl.edu/ • SAM – another free profile HMM software • http://www.cse.ucsc.edu/research/compbio/sam.html • PFAM – database of alignments and HMMs for protein families and domains • http://www.sanger.ac.uk/Software/Pfam/ • SCOP – a structural classification of proteins • http://scop.berkeley.edu/data/scop.b.html

Sequence Similarity