380 likes | 493 Vues
The PROBCONS algorithm applies a probabilistic consistency-based approach for multiple sequence alignments. It utilizes pair-HMM models to parameterize alignments, capturing transition probabilities like gap penalties and emission probabilities based on substitution matrices from BLOSUM. By employing dynamic programming, it computes alignments that maximize expected accuracy instead of just maximum likelihood. The method iteratively refines alignment accuracy, leveraging consensus across multiple sequences. Key features include using the Viterbi algorithm and integrating a tree-based structure for progressive alignment, enhancing the reliability of protein alignments.
E N D
xi ― xi yj MATCH PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERT X INSERT Y ― yj
xi yj MATCH INSERT X INSERT Y xi ― ― yj A pair-HMM model of pairwise alignment • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences • Transition probabilities ~ gap penalties • Emission probabilities ~ substitution matrix (from BLOSUM) x ABRACA-DABRA AB-ACARDI--- y
Computing Pairwise Alignments • The Viterbi algorithm • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y • identifies highest probability alignment, αviterbi, in O(L2) time Caveat: the mostlikely alignment is not the mostaccurate • Alternative: find the alignment of maximum expected accuracy P(α) αviterbi P(α | x, y)
4. F 4. T 4. F 4. F 4. F B A- A A- A 4. F 4. F 4. T 4. F 4. F B- B+ B+ B- C The Lazy-Teacher Analogy • 10 students take a 10-question true-false quiz • How do you make the answer key? • Approach #1: Use the answer sheet of the best student! • Approach #2: Weighted majority vote!
Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes Eα*[1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes Eα*[accuracy(α, α*)] 4. T A Viterbi vs. Maximum Expected Accuracy (MEA) 4. F 4. F 4. T 4. F 4. F B A- A A- A 4. F 4. F 4. F 4. F 4. T C B- B+ B+ B-
Computing MEA alignments • Define accuracy (α, α*) = Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y) = ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’) = ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’) = ∑(xi, yj) in α P(xi, yj in α’ | x, y) • Define M[i, j] = posterior probability that xi is aligned to yj # of correct predicted matches length of shorter sequence
Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[i, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence
Computing MEA alignments • Define accuracy (α, α*) = • Then, MEA alignment is highest summing path through the matrix M[i, j] = P(xi is aligned to yj | x, y) • M[I, j] = posterior probability that xi is aligned to yj • Can compute with forward, backward dynamic programming in O(L2) time # of correct predicted matches length of shorter sequence
The consistency signal zk z xi x y yj yj’
To estimate P(xiyj | x, y, z) Method 1:triplet-HMM P(xi ~ yj | x, y, z) = ∑k P(xi~yj~zk | x, y, z) Parameters trained with unsupervised EM Running time: O(N3L3) N: # sequences L: sequence lengths
Probabilistic consistency • Compute P(xi is aligned to yj | x, y) P(xi is aligned to yj | x, y, z) • 2 approaches: • 1) Exact – triplet HMM, O(L3) time • 2) Approximate – use independence assumptions ∑k P(xi ~ zk and zk ~ yj | x, y, z) = ∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk) (assume indep.) ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)
Probabilistic consistency • Compute P(xi is aligned to yj | x, y, z) To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y) Notice that for any given i, most entries k and j will be close to 0 -- sparse matrices Pxy|z PxzPzy Finally, let Pxy|S 1/|S|∑z in S PxzPzy
ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement
ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABRACADABRA ABRA--DABI- AB-ACARDI--- ABRA---DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI- ABRACA-DABRA AB-ACARDI--- ABACARDI ABRACADABRA ABACARDI ABRADABI Multiple sequence alignment • A straightforward generalization • sum-of-pairs • tree-based progressive alignment • iterative refinement
Summary of PROBCONS Algorithm • Given K sequences to be aligned, • Compute M[i, j] for all pairs of sequences, x and y • (2) Use probabilistic consistency to reestimate M[i, j] • Build a tree of the sequences by connecting closest first • “Closest” defined according to expected accuracy • EA(x, y) = E(accuracy) of MEA alignment of x and y • Perform progressive alignment along the tree • Score of a column: sum-of-pairs M[i, j] • Apply iterative refinement
Training/testing methodology • 3 reference benchmark sets • PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE. • Quality score: Q(α, α*) = BAliBASE PREFAB SABmark # of correct predicted matches total # of true matches
Evaluation of Algorithm Components all-pairs pairwise multiple
Resources for alignment Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used (1994) http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable (2004) http://probcons.stanford.edu/ PROBCONS – most accurate (2004) Some more protein multiple aligners: MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc……. ProbCons: Chuong (Tom) Do
PFAM Protein FAMilies database of alignments • Profile HMMs describe each family • For each family in Pfam you can: • Look at multiple alignments • View protein domain architectures • Examine species distribution • Follow links to other databases • View known protein structures
PFAM Pfam-A – curated multiple alignments • Grows slowly; quality controlled by experts Pfam-B – automatic clustering (ProDom derived) • New sequences instantly incorporated; unchecked • Search by: Sequence, keyword, domain, taxonomy • Browsing by family or genome • Evolutionary tree • Source of seed alignments: • Pfam-B families • Published articles • ‘Domain hunting' studies
Profile HMMs • Each M state has a position-specific pre-computed substitution table • Each I state has position-specific gap penalties (and in principle can have its own emission distributions) • Each D state also has position-specific gap penalties • In principle, D-D transitions can also be customized per position Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F
Profile HMMs • transition between match states – αM(i)M(i+1) • transitions between match and insert states – αM(i)I(i), αI(i)M(i+1) • transition within insert state – αI(i)I(i) • transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1) • transition within delete state – αD(i)D(i+1) • emission of amino acid b at a state S – εS(b) Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F
Profile HMMs • transition probabilities ~ frequency of a transition in alignment • emission probabilities ~ frequency of an emission in alignment • pseudocounts are usually introduced Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Protein Family F
Alignment of a protein to a profile HMM To align sequence x1…xn to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm • Define • VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj • VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij • VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj) • Denote by qa the frequency of amino acid a in a ‘random’ protein
Alignment of a protein to a profile HMM Vj-1M(i – 1) + log αM(j-1)M(j) • VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j) Vj-1D(i – 1) + log αD(j-1)M(j) VjM(i – 1) + log αM(j)I(j) • VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j) VjD(i – 1) + log αD(j)I(j) Vj-1M(i) + log αM(j-1)D(j) • VjD(i) = max Vj-1I(i) + log αI(j-1)D(j) Vj-1D(i) + log αD(j-1)D(j)
Weight of each sequence i h g f • One simple weighting scheme is to find how much edge length each leaf contributes • Example: edge 1 belongs to a • Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a Δwi = ecurrent wi / (leaves k below ecurrentwk) e d c 2 b 3 1 a
Resources on the web • HMMer – a free profile HMM software • http://hmmer.wustl.edu/ • SAM – another free profile HMM software • http://www.cse.ucsc.edu/research/compbio/sam.html • PFAM – database of alignments and HMMs for protein families and domains • http://www.sanger.ac.uk/Software/Pfam/ • SCOP – a structural classification of proteins • http://scop.berkeley.edu/data/scop.b.html