1 / 34

Seeds for Similarity Search

Seeds for Similarity Search. Presentation by: Anastasia Fedynak. Homology Search. Homology search consumes 10% of the world’s supercomputing time NCBI Blast server processes 10 5 queries/day GenBank doubles in size every 18 months Completed genomes: human, mouse, rice, fly, etc

keyanna
Télécharger la présentation

Seeds for Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seeds for Similarity Search Presentation by: Anastasia Fedynak

  2. Homology Search • Homology search consumes 10% of the world’s supercomputing time • NCBI Blast server processes 105 queries/day • GenBank doubles in size every 18 months • Completed genomes: human, mouse, rice, fly, etc • Software must be scalable for large datasets

  3. Homology Search Tools • Identify short seed matches (consecutive k bases) between DNA sequences which are then extended • BLAST, FASTA too slow and miss many alignment • Smith-Waterman DP too slow • MegaBlast high speed, works well for highly similar sequences

  4. Discontiguous Seeds • Requires matching pairs of bases at a subset of positions • Califanoand Rigoutsos (1993) • Random discontiguous pattern in FLASH • Buhler (2001) • Sensitivity of random patterns in LSH-ALL-PAIRS comparison algorithm • Blastz underlying PipMaker program (2000) • PatternHunter (Ma, Tromp, and Li, 2002)

  5. Resource-constrained paradigm of seed design Given a collection of ungapped genomic sequence similarities of fixed length l, modeled by kth-order Markov model, M, find n seeds π1 … πn, such that the probability of detecting a similarity is maximized

  6. Problem Definition • Let C be collection of genomic sequences of l bases 1 = match 0 = mismatch • Starting point for gapped extension similarity

  7. Problem Definition • Similarity is modeled by kth order markov process, M • Gives the probability the next bit seen will be a 1 (match) • Coding regions exhibit the pattern {1, 1, 0}, protein with silent mutations at 3rd base position of codon

  8. Problem Definition • Devise a seed π, an ordered list of w positions {x1…xw},with weight w and span s • Ex. π = {1,3,4,6,7} w=5, s=7 • π detects S iff at offset j S[j+ xi] = 1 for 1 ≤ i ≤ w i.e. For every position of π, at offset j, S must contain matching bases S = 1011011 match S = 1001011 mismatch

  9. Problem Definition • Find a seed π, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [π detects S ] S~M

  10. Selecting Good Seeds • Seed length determined by a tradeoff between speed and sensitivity: • Larger k = fast speed, low sensitivity • Small k = slow speed, high sensitivity • Blast uses k consecutive letters as seeds • k =11 in Blastn and k =28 in MegaBlast

  11. Selecting Good Seeds • INDEPENDENCE: probabilities of matches at different offsets are not independent • Generally, fewer bases shared between seed and shifted copies, higher sensitivity • Consecutive models → low sensitivity

  12. PatternHunter • Optimal model via DP : 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases

  13. PatternHunter • Optimal model via DP : 111010010100110111 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases

  14. Spaced vs. Consecutive Seeds LEMMA: Expected number of hits with weight w, span s, within a length l region of similarity 0 ≤ p ≤ 1 is: (l – s + 1)pw Example: In a region of length 64 and similarity 0.7

  15. Quality Comparison

  16. Quality Comparison

  17. Performance Comparison

  18. Mandala – Seed Selection • Let π = {x1…xw} be the current seed • Define local neighbourhood of π as set of all seeds π’ that differ from π in one position. • Hill climbing with random restart to find a near-optimal seed • Evaluation based on probability calculation

  19. Detection Probabilities • Detection probability encodes overlap structure of a seed into DFA • DP computes the probability DFA accepts a random similarity of length l from kth-order Markov model, M • P(q,t,δ·b) probability of reaching state q after reading t bits of an input S, the last k+1 of which are δ·b. For a state q, let Φb(q) is the set of all states that transition to q on bit b. P(q,t,δ·b) = Pr(S[t]=b|S[t-k’…t-1] = δ) x ∑ ∑ P(q’,t-1,b0· δ) q’ЄΦb(q) b0Є{0,1}

  20. Performance Comparison – Non Coding DNA Sequence

  21. Performance Comparison – Coding DNA Sequence

  22. Influence of Model Order M5 model (solid line) exploits nearest-neighbor Mc5 model (Dashed line) – exploits correlation arising from codon structure

  23. Multi-Seed Design – Why? • Seed matching heuristics optimize a tradeoff between sensitivity (true +ve rate) and specificity (1 – false +ve rate) • True +ve: alignment contains a seed match • False +ve: Prob match occurs by chance (~ 1/4w bases) • Increase w • reduces π’s false +ve • But lowers sensitivity

  24. Multi-Seed Design – Why? • Multiple seeds provide a more attractive way to trade sensitivity for specificity • Set ∏ of seeds with weight w’ > w • Expected chance matches is: |∏|/4w’

  25. Problem Definition • A seed π matches alignment α → Eπ(α) • Mismatch → Eπ(α) • Match probability of π in M is given by: Pr (Eπ(α)) • A set ∏ matches α, if at least one of its seeds matches (E∏(α)) α ~M

  26. Problem Definition • Find a set П of n seeds, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [ П detects S ] S~M

  27. Algorithms for Multi-Seed Design • Local Approach • Used in Mandala • Greedy Covering • Beam Search

  28. Mandala’s Local Search Algorithm • Given w and s • Begin with a set ∏0 of n randomly chosen seeds with common w and s • Choose i and j, where 1≤i≤n and 2≤j≤w, then, find the best seed set ∏1 in the neighbourhood of ∏0 by deleting position xj of the ith seed πiЄ ∏0 , and replacing it with a position between 1 and s-1 not currently inspected by πi • Iterates through i and j until no further improvements are possible

  29. Greedy Heuristic for Computing Seed Sets • Given a partial seed set ∏0, choose the next seed that maximizes the conditional match probability for alignment model M: Pr(Eπ|E∏) • i.e. highest-probability alignment not already matched by some seed in the current set • Start from a single locally optimal seed

  30. Extension to Beam Search • Initially find a number of locally optimal single seeds • The best b are saved and used in the next optimization round • For each saved seed, we find N seeds, each of which locally optimizes Pr(Eπ|E∏) • The b seed pairs {π0, π} with highest match probability over all b·N pairs are again saved. • Best seed set overall is choosen

  31. Performance

  32. Computing Conditional Match Probabilities • Construct DFA, Aπ that accepts alignments containing a seed match to π • By DP, compute Pr Aπ accepts a random alignment of length l from M • Compute Pr(Eπ|E∏) for seed π and set ∏ Pr(Eπ|E∏) = Pr(E∏Uπ) - Pr(E∏) 1 - Pr(E∏)

  33. Detection Probabilities • Let π be a seed weight w span s • Qπ set of all s-bit strings matching π • Construct a trie Tπfrom the strings of Qπ • Convert Tπto DFA Aπ(Aho-Corasick alg) • accepts a similarity S, if π detects S

More Related