CSE182-L5: Scoring matrices Dictionary Matching

CSE182-L5: Scoring matrices Dictionary Matching CSE 182

Silly Quiz CSE 182

PAM 1 distance • Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch • PAM1(a,b) = Pr[residue a substitutes residue b, when the sequences are 1 PAM apart] CSE 182

PAM 1 CSE 182

Generating Higher PAMs • PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) • PAM2 = PAM1 * PAM1 (Matrix multiplication) • PAM250 • = PAM1*PAM249 • = PAM1250 b b c a a = c PAM1 PAM2 PAM1 CSE 182

Scoring residues • A reasonable score function C(a,b) is given as follows: • Look at ‘high quality’ alignments • C(a,b) should be high when a,b are seen together more often than is expected by chance • C(a,b) should be low, otherwise. • How often would you expect to see a,b together just by chance? • Pa Pb • Let Pab be the probability that a and b are aligned in a high-quality alignment • A good scoring function is the log-odds score • C(a,b)= log10 (Pab/PaPb) CSE 182

Scoring alignments • To compute Pab, we need ‘high-quality’ alignments • How can you get quality alignments? • Use SW (But that needs the scoring function) • Build alignments manually • Use Dayhoff’s theory to extrapolate from high identity alignments CSE 182

Scoring using PAM matrices • Suppose we know that two sequences are 250 PAMs apart. • S(a,b) = log10(Pab/PaPb)= log10(Pa|b/Pa) = log10(PAM250(a,b)/Pa) • How does it help? • S250(A,V) >> S1(A,V) • Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. • An alignment with a smaller % identity could still have a higher score and be more significant hum mus dros CSE 182

PAM250 based scoring matrix • S250(a,b) = log10(Pab/PaPb) = log10(PAM250(a,b)/Pa) CSE 182

BLOSUM series of Matrices • Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions • A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. • BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. • In practice BLOSUM62 seems to work very well. CSE 182

PAM vs. BLOSUM • What is the correspondence? • PAM1 Blosum1 • PAM2 Blosum2 • Blosum62 • PAM250 Blosum100 CSE 182

P-value computation • We use text filtering to filter the database quickly. • The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. • The results are presented in order. • How significant is the top scoring hits if it has a score S? • Expect/E-value (score S)= Number of times we would expect to see a random query generate a score S, or better • How can we compute E-value? CSE 182

What is a distribution function • Given a collection of numbers (scores) • 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,…. • Plot its distribution as follows: • X-axis =each number • Y-axis (count/frequency/probability) of seeing that number • More generally, the x-axis can be a range to accommodate real numbers CSE 182

P-value computation • How significant is a score? What happens to significance when you change the score function • A simple empirical method: • Compute a distribution of scores against a random database. • Use an estimate of the area under the curve to get the probability. • OR, fit the distribution to one of the standard distributions. CSE 182

Z-scores for alignment • Initial assumption was that the scores followed a normal distribution. • Z-score computation: • For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation • Look up a table to get a P-value CSE 182

Blast E-value • Initial (and natural) assumption was that scores followed a Normal distribution • 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution • Practical consequence: • Longer tail. • Previously significant hits now not so significant CSE 182

Altschul Karlin statistics • For simplicity, assume that the database is a binary string, and so is the query. • Let match-score=1, • mismatch score=- , • indel=- (No gaps allowed) • What does it mean to get a score k? CSE 182

Exponential distribution • Random Database, Pr(1) = p • What is the expected number of hits to a sequence of k 1’s • Instead, consider a random binary Matrix. Expected # of diagonals of k 1s CSE 182

As you increase k, the number decreases exponentially. • The number of diagonals of k runs can be approximated by a Poisson process • In ungapped alignments, we replace the coin tosses by column scores, but the behaviour does not change (Karlin & Altschul). • As the score increases, the number of alignments that achieve the score decreases exponentially CSE 182

Blast E-value • Choose a score such that the expected score between a pair of residues < 0 • Expected number of alignments with a score S • For small values, E-value and P-value are the same CSE 182

The last step in Blast • We have discussed • Alignments • Db filtering using keywords • Scoring matrices • E-values and P-values • The last step: Database filtering requires us to scan a large sequence fast for matching keywords CSE 182

Dictionary Matching, R.E. matching, and position specific scoring CSE 182

Keyword search • Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. • Question: Given a collection of strings (keywords), find all occurrences in a database string where they keyword might match. CSE 182

Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O • Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. • How fast can this be done? database dictionary CSE 182

Dict. Matching & string matching • How fast can you do it, if you only had one word of length m? • Trivial algorithm O(nm) time • Pre-processing O(m), Search O(n) time. • Dictionary matching • Trivial algorithm (l1+l2+l3…)n • Using a keyword tree, lpn (lp is the length of the longest pattern) • Aho-Corasick: O(n) after preprocessing O(l1+l2..) • We will consider the most general case CSE 182

Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O P O T A T O Observations: • When we mismatch, we (should) know something about where the next match will be. • When there is a mismatch, we (should) know something about other patterns in the dictionary as well. CSE 182

O A P M S T T T O T S I U A E The Trie Automaton • Construct an automaton A from the dictionary • A[v,x] describes the transition from node v to a node w upon reading x. • A[u,’T’] = v, and A[u,’S’] = w • Special root node r • Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v 1 r S 2 w 3 CSE 182

Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat An O(lpn) algorithm for keyword matching CSE 182

c l O A M P T S T T O T S I U A E Illustration: P O T A S T P O T A T O v 1 S 2 3 CSE 182

Idea for improving the time • Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match • Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j CSE 182

O A S T M P T T O T S I U A E Improving speed of dictionary matching • Every node v corresponds to a string sv that is a prefix of some pattern. • Define F[v] to be the node u such that su is the longest suffix of sv • If we fail to match at v, we should jump to F[v], and commence matching from there • Let lp[v] = |su| 2 3 4 5 1 S 11 6 7 9 10 8 CSE 182

End of L5 CSE 182

CSE182-L5: Scoring matrices Dictionary Matching