Seeds for Similarity Search

Seeds for Similarity Search Presentation by: Anastasia Fedynak

Homology Search • Homology search consumes 10% of the world’s supercomputing time • NCBI Blast server processes 105 queries/day • GenBank doubles in size every 18 months • Completed genomes: human, mouse, rice, fly, etc • Software must be scalable for large datasets

Homology Search Tools • Identify short seed matches (consecutive k bases) between DNA sequences which are then extended • BLAST, FASTA too slow and miss many alignment • Smith-Waterman DP too slow • MegaBlast high speed, works well for highly similar sequences

Discontiguous Seeds • Requires matching pairs of bases at a subset of positions • Califanoand Rigoutsos (1993) • Random discontiguous pattern in FLASH • Buhler (2001) • Sensitivity of random patterns in LSH-ALL-PAIRS comparison algorithm • Blastz underlying PipMaker program (2000) • PatternHunter (Ma, Tromp, and Li, 2002)

Resource-constrained paradigm of seed design Given a collection of ungapped genomic sequence similarities of fixed length l, modeled by kth-order Markov model, M, find n seeds π1 … πn, such that the probability of detecting a similarity is maximized

Problem Definition • Let C be collection of genomic sequences of l bases 1 = match 0 = mismatch • Starting point for gapped extension similarity

Problem Definition • Similarity is modeled by kth order markov process, M • Gives the probability the next bit seen will be a 1 (match) • Coding regions exhibit the pattern {1, 1, 0}, protein with silent mutations at 3rd base position of codon

Problem Definition • Devise a seed π, an ordered list of w positions {x1…xw},with weight w and span s • Ex. π = {1,3,4,6,7} w=5, s=7 • π detects S iff at offset j S[j+ xi] = 1 for 1 ≤ i ≤ w i.e. For every position of π, at offset j, S must contain matching bases S = 1011011 match S = 1001011 mismatch

Problem Definition • Find a seed π, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [π detects S ] S~M

Selecting Good Seeds • Seed length determined by a tradeoff between speed and sensitivity: • Larger k = fast speed, low sensitivity • Small k = slow speed, high sensitivity • Blast uses k consecutive letters as seeds • k =11 in Blastn and k =28 in MegaBlast

Selecting Good Seeds • INDEPENDENCE: probabilities of matches at different offsets are not independent • Generally, fewer bases shared between seed and shifted copies, higher sensitivity • Consecutive models → low sensitivity

PatternHunter • Optimal model via DP : 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases

PatternHunter • Optimal model via DP : 111010010100110111 111010010100110111 • w = 11, s = 18 • shifted copy shares 5 bases

Spaced vs. Consecutive Seeds LEMMA: Expected number of hits with weight w, span s, within a length l region of similarity 0 ≤ p ≤ 1 is: (l – s + 1)pw Example: In a region of length 64 and similarity 0.7

Quality Comparison

Performance Comparison

Mandala – Seed Selection • Let π = {x1…xw} be the current seed • Define local neighbourhood of π as set of all seeds π’ that differ from π in one position. • Hill climbing with random restart to find a near-optimal seed • Evaluation based on probability calculation

Detection Probabilities • Detection probability encodes overlap structure of a seed into DFA • DP computes the probability DFA accepts a random similarity of length l from kth-order Markov model, M • P(q,t,δ·b) probability of reaching state q after reading t bits of an input S, the last k+1 of which are δ·b. For a state q, let Φb(q) is the set of all states that transition to q on bit b. P(q,t,δ·b) = Pr(S[t]=b|S[t-k’…t-1] = δ) x ∑ ∑ P(q’,t-1,b0· δ) q’ЄΦb(q) b0Є{0,1}

Performance Comparison – Non Coding DNA Sequence

Performance Comparison – Coding DNA Sequence

Influence of Model Order M5 model (solid line) exploits nearest-neighbor Mc5 model (Dashed line) – exploits correlation arising from codon structure

Multi-Seed Design – Why? • Seed matching heuristics optimize a tradeoff between sensitivity (true +ve rate) and specificity (1 – false +ve rate) • True +ve: alignment contains a seed match • False +ve: Prob match occurs by chance (~ 1/4w bases) • Increase w • reduces π’s false +ve • But lowers sensitivity

Multi-Seed Design – Why? • Multiple seeds provide a more attractive way to trade sensitivity for specificity • Set ∏ of seeds with weight w’ > w • Expected chance matches is: |∏|/4w’

Problem Definition • A seed π matches alignment α → Eπ(α) • Mismatch → Eπ(α) • Match probability of π in M is given by: Pr (Eπ(α)) • A set ∏ matches α, if at least one of its seeds matches (E∏(α)) α ~M

Problem Definition • Find a set П of n seeds, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [ П detects S ] S~M

Algorithms for Multi-Seed Design • Local Approach • Used in Mandala • Greedy Covering • Beam Search

Mandala’s Local Search Algorithm • Given w and s • Begin with a set ∏0 of n randomly chosen seeds with common w and s • Choose i and j, where 1≤i≤n and 2≤j≤w, then, find the best seed set ∏1 in the neighbourhood of ∏0 by deleting position xj of the ith seed πiЄ ∏0 , and replacing it with a position between 1 and s-1 not currently inspected by πi • Iterates through i and j until no further improvements are possible

Greedy Heuristic for Computing Seed Sets • Given a partial seed set ∏0, choose the next seed that maximizes the conditional match probability for alignment model M: Pr(Eπ|E∏) • i.e. highest-probability alignment not already matched by some seed in the current set • Start from a single locally optimal seed

Extension to Beam Search • Initially find a number of locally optimal single seeds • The best b are saved and used in the next optimization round • For each saved seed, we find N seeds, each of which locally optimizes Pr(Eπ|E∏) • The b seed pairs {π0, π} with highest match probability over all b·N pairs are again saved. • Best seed set overall is choosen

Performance

Computing Conditional Match Probabilities • Construct DFA, Aπ that accepts alignments containing a seed match to π • By DP, compute Pr Aπ accepts a random alignment of length l from M • Compute Pr(Eπ|E∏) for seed π and set ∏ Pr(Eπ|E∏) = Pr(E∏Uπ) - Pr(E∏) 1 - Pr(E∏)

Detection Probabilities • Let π be a seed weight w span s • Qπ set of all s-bit strings matching π • Construct a trie Tπfrom the strings of Qπ • Convert Tπto DFA Aπ(Aho-Corasick alg) • accepts a similarity S, if π detects S

Seeds for Similarity Search

Seeds for Similarity Search

Presentation Transcript

Designing Multiple Simultaneous Seeds for DNA Similarity Search

Data-dependent Hashing for Similarity Search

Geometry of Similarity Search

Similarity Search in Visual Data

A Metric Cache for Similarity Search

Distributed Spatio-Temporal Similarity Search

Similarity Search in Protein Databases

User Oriented Trajectory Similarity Search

A General Algorithm for Subtree Similarity-Search

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition

Distributed Spatio-Temporal Similarity Search

Database Similarity Search

Sequence Similarity Search: an Overview

Similarity Search for Web Services

Cache-Conscious Performance Optimization for Similarity Search

Connected Substructure Similarity Search

Similarity Search in Arbitrary Subspaces

Similarity Search

Probabilistic Similarity Search for Uncertain Time Series

Content-Based Similarity Search

Distributed Spatio-Temporal Similarity Search

Operators for Similarity Search