1 / 45

Index-based search of single sequences

Index-based search of single sequences. Omkar Mate CS 374 Stanford University. Motivation. A newly discovered gene …. (may hold clues to evolution of human brain capacity ). Occurrence in other species Mutation …. Sequence Alignment. New Query. Existing Genome Database. …………………………

janine
Télécharger la présentation

Index-based search of single sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Index-based search of single sequences Omkar Mate CS 374 Stanford University

  2. Motivation A newly discovered gene … (may hold clues to evolution of human brain capacity ) • Occurrence in other species • Mutation • …

  3. Sequence Alignment New Query Existing Genome Database ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… ………………………… gattaccagattaccagattaccagattacca caggattacaggattacaggattacaggatta cattaggacattaggacattaggacattagga aaccattagaccattagaccattagaccatta ………………………… gacatta Easy??? Think again………..

  4. State of Biological Databases ~300 sequenced genomes

  5. Alignment Problem • Assume we try Smith-Waterman: [running time = O(MN)] The entire genomic database 1011 Our new gene 104 Time and Space complexity: O(1015) Huge number! But we can do better … 

  6. Indexing-based Local Alignment BLAST- Basic Local Alignment Search Tool Main idea: • Construct a dictionary of all the words in the query • Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman! query DB hits

  7. BLAST Step 1: Construct dictionary of query words (Query indexed by all words of size, k = 4) Query: AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… AACGTTGATCAGCTAGACTGACTAGCATCAGCATCAGCATCAGCATC… Index of query words

  8. BLAST • Step 2 – Generate all the relatives of a word (Relative: a word with alignment score greater than a threshold, T) Query Word: ATGC, T: 50 Candidates Score Relatives! Update the index … (Query: ) AATGCCGATAGCATCG …

  9. BLAST • Step 3: Searching • Search through database linearly, one word at a time • Initiate alignment with all occurrences of that word in query Database: ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … ATCGCTATCGCTACGACTACGACTACGATCAGCATCTC … Query: Index:

  10. BLAST • Step 4 – Alignment Extension Once we find an alignment, extend to left and right with no gaps until alignment score falls below a certain threshold. ATGCCGATACGATCAGCTACGATCAG… ATGCCGATACGATCAGCTACGATCAG…

  11. Sensitivity-Speed Tradeoff Longer words => Fewer alignments => Faster but Low chance of a match Shorter words => More alignments => High chance of a match but Slower

  12. BLAT - The BLAST-Like Alignment Tool Similarities: • Rapid scans for relatively short matches (hits) • Extend these hits into high-scoring pairs (HSP) Differences: BLAST BLAT • Index of query sequence Index of database • Scan through database Scan through query sequence • Gaps not allowed in ext. Gaps allowed in extension • Returns smaller alignments Returns larger alignments

  13. BLAT Strategies - I • Single perfect matches We do not allow any mismatch. Common intuition : fewer matches for longer words

  14. Sensitivity and SpecificitySingle Perfect Nucleotide K-mer Matches as a Search Criterion

  15. BLAT Strategies - II • Allow one mismatch Intuition: Higher number of matches for same word length => Better sensitivity (Caution: Keep k higher, else no. of matches will be huge)

  16. Sensitivity and SpecificitySingle Near-Perfect Nucleotide K-mer Matches as a Search Criterion (one mismatch allowed)

  17. BLAT Strategies - III Allow multiple perfect matches Two parameter: N: no. of matches, K: word size Practically: Same sensitivity, higher speed

  18. Sensitivity and SpecificityMultiple Perfect Nucleotide K-mer Matches as a Search Criterion (2 and 3 perfect matches)

  19. Seeded AlignmentA dominant paradigm for fast comparisons • Seed: A common pattern of positions used for efficient large scale comparison of genomic DNA …G A T T A C C A G A T T A C C A G A T T A … Seed: {0,1,2,3} => Comparison sequences: {G A T T} {A T T A} {T T A C} … Seed: {0,2,4,6} => Comparison sequences: {G T A C} {A T C A} {T A C G} …

  20. Similarity Detection Sequence 1: Sequence 2: Seed: {0,2,4,5} A T C G A C T A T C G AC T A T C G A C T C T A G T C T C T A G TC T C T A G T C T Offset = 0 => Mismatch Offset = 1 => Match We can have multiple seeds (patterns)!

  21. Seed Design A (hazy) problem definition Collection of ungapped genomic sequence similarities Parameters: length of seeds, resource limits ….. Algorithm A set of seeds that will give “optimum performance” (What are the parameters? How do you define optimum performance?)

  22. Tasks in Seed Designing • Define a measure of goodness for a seed. Easy ! Sensitivity to interesting biosequence similarities. • Show how to evaluate goodness for a seed. Hard ! No efficient algorithm.

  23. Terminology related to a Seed A seed, P = a set of ordered list of w positions i.e. P = {x1, x2, …, xw} w = weight of P = |P| s = span of P = xw – x1 + 1 Ex: P = {0, 1, 4, 5} w = 4 s = 5 – 0 + 1 = 6

  24. Computational Cost Seed weight w f Computational Cost No. of seeds n

  25. Performance Measurement Optimum performance => Maximum sensitivity (i.e. detection probability) to the similarities S (Currently, Markov Models are used to measure these probabilities!)

  26. Markov Models • A kth order Markov model M: Given k bits, predict (k+1)th bit 4th order Markov Model

  27. (Exact) Problem Definition Inputs: • Number of seeds: n • Weight of each seed: w • Markov Model: M • Similarities: S Output: A set of seeds (ordered positions), P = {x11, …, x1w}, {x21,…,x2w},…,{xn1,…,pnw} that maximizes detection probability for S

  28. Computing Detection Probabilities • Challenge: The probability of at least one match varies because the probabilities of matches at different offsets are not independent. Ex: Seed = {0,2,3,5} This similarity has 2 matches, at offsets 0 and 2, which share two of four positions in common

  29. DFA to compute Detection Probability(Deterministic Finite Automaton) • Construct a DFA that accepts a string of 1’s and 0’s defined by the seed P. • Ex: P = {0,2} i.e. for a substring of length 3, we need a match in 1st and 3rd position. • Then the DFA should accept strings given by the regular expression: “(0+1)*1(0+1)1(0+1)*”

  30. Dynamic Programming Algorithm To compute the detection probabilities recursively! Complexity analysis: • Size of DFA <= • Time to construct a DFA = • Time for each step = • No. of Steps = l • Total time complexity = This is faster by a factor of s2/wthan the best previous algorithm for detection probabilities

  31. Remarks about the algorithm • Can be extended to work with a set of seeds • The DFA need not be minimal • Time complexity can be further reduced

  32. Structure in Seed Space • Addressing the problem: When is one seed more sensitive than another? • Factors: • Parameters of the Markov model M • Similarity length : l • Smaller length: irregular behavior => We can generalize only for asymptotic cases

  33. Asymptotic Result Let, El (P) = Event that P detects S at some offset Elc (P) = complementary event Then, A seed P is asymptotically worse than a seed P’, P < P’, if Liml Pr[ Elc (P) / Elc (P’) ] > 1 (P’ has more chances of detecting S, than P does!)

  34. Mandala: Fast, Practical Seed Design • Seed selection: • No efficient algorithm to find optimum w, s given M (except Brute Force) • Applies local search method; global efficiency sacrificed • Training a Markov model: • Training set is adaptively selected to suit the intended application • Samples training set using LSH-ALL-PAIRS algorithm

  35. Experimental Results - I • Avg. detection probabilities given by theoretical models for random seeds (w = 11) Solid line: M0 Dashed line: M5

  36. Experimental Results - II • Detection probabilities for best seeds found by Mandala (k = 5) Solid: noncoding DNA model Dashed: coding DNA model

  37. Directions for further research • Extend the model to evaluate seeds • Extend similarity models to distinguish between different classes of substitution • Construct models of multiple alignment: to compare 3 or more genomes at once

  38. Timing - BLAT vs WU-TBLASTX • Dataset: 1000 Mouse Reads and a RepeatMasked Human Chromosome 22

  39. Sensitivity – BLAT vs WU-TBLASTX • Dataset: 13 million Mouse Shotgun Reads and Human Chromosome 22

  40. Sensitivity and SpecificitySingle Perfect Amino Acid K-mer Matches as a Search Criterion

  41. Sensitivity and SpecificitySingle Near-Perfect Amino Acid K-mer Matches as a Search Criterion (one mismatch allowed)

  42. Sensitivity and SpecificityMultiple Perfect Amino Acid K-mer Matches as a Search Criterion (2 and 3 perfect matches)

  43. Mathematical Formula • To compute the probability that the DFA associated with a seed accepts a string randomly chosen from a Markov Model M S = similarity, l = length of S, k = order of M, delta = bit string of length k, q = a state, Phi(q) = set of states that transition to q on bit b t = no. of bits read, k’ = min {k, t}

  44. Dynamic Programming • Initialize the recurrence: P (q0, 0, 0) = 1 • After l steps, return the sum over all k+1-mer bit strings delta.1 of P (qa, l , delta.1)

More Related