1 / 26

Design of Optimal Multiple Spaced Seeds for Homology Search

Design of Optimal Multiple Spaced Seeds for Homology Search. Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li and B. Ma. Overview. Seed-based homology search Optimal multiple spaced seeds LP based randomized algorithm Experimental results

felice
Télécharger la présentation

Design of Optimal Multiple Spaced Seeds for Homology Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li and B. Ma

  2. Overview • Seed-based homology search • Optimal multiple spaced seeds • LP based randomized algorithm • Experimental results • Future work

  3. Homology Search • Exhaustive search algorithm • e.g. Smith-Waterman algorithm • 100% sensitivity • infeasible if the database is large • Suffix tree • Seed-based algorithm, e.g. BLAST, PatternHunter Given: database of DNA sequences, query sequence Q Task: extract all homologous sequences of Q from the database.

  4. Region and Seed S1: AGCTTGCCGTAAACCG S2: ACGTAGCACTGAGCTG Region model: 1001011001010101 seed: 1001001011 1: a required match 0: “don’t care” seed length M: length of the string seed weight W: the number of 1 in the seed

  5. Seed-based Hit A seed S hits a region R at position i if and only if R[i+j]=1 for every position j where s[j]=1 Query: ACGCGTGGGAAACC region 00001111101100 CAATGTGGGCAATT seed 11011011 Given a seed, a query sequence hits another sequence if and only if the seed hits a region model of both sequences.

  6. Single Seed Based Algorithm Query: GGAAGCTTGCCGTATGCCATAG S1: CCAGGCTAGCCATAGGCCTTCT Seed:101110111011011101 Length=18, weight=13 Query: GGAAGCTTGCCGTATGCCATAG S2: CCAGGCATGCAGTAGGCCTTCT S1 hit, but S2 missed.

  7. Multiple Seeds Based Algorithm Query: GGAAGCTTGCCGTATGCCATAG S1: CCAGGCTAGCCATAGGCCTTCT seed1:101110111011011101 Length=18, weight=13 seed2:101101110111011101 Query: GGAAGCTTGCCGTATGCCATAG S2: CCAGGCATGCAGTAGGCCTTCT Both S1 and S2 are hit

  8. Optimal Multiple Seeds (OMS) Problem Given: random region R under certain distribution, two integers M and W, and an integer k. Find: set of k seeds with weight W and length no more than M to maximize the hit probability of R.

  9. Related Work • Mandala (J. Buhler et al.) • Hill Climbing, good for small k, no result reported for k>4 • Greedy + Monte Carlo sampling • Greedy Algorithm (M. Li and B. Ma et al.) • Given i seeds (i=1,2,…,k-1), search for the next seed by maximizing the incremental sensitivity • Vector Seeds (B. Brejova et al.)

  10. Variants of OMS • Seed Specific OMS problem: Given a random region R, a set of m seeds , and an integer k, find a set of k seeds out of , to maximize the hit probability of R. • Seed-Region Specific OMS problem: Given a set of m seeds , an integer k and a set of regions , find a set of k seeds, to maximize the hits of .

  11. Maximum Coverage (MC) problem Given a ground set H and its subsets and an integer k, Find k sets out of to cover H as much as possible. • Main Results: • Approximation ratio by a greedy algorithm • (D.S. Hochbaum) • Same approximation ratio by linear programming based • randomized algorithm • is tight unless P=NP (U. Feige)

  12. OMS vs. MC Problem OMS Seed Enumeration Seed Specific OMS Region Sampling Seed-Region Specific OMS=MC Problem

  13. Region Model • PH: length 64 and each bit independently set to 1 with probability 0.7 (B. Ma et al.) • M3: length 64 and each bit independently set to 1 with probability 0.8 if i%3=1 or 2, 0.5 otherwise (B. Brejova et al.) • M8: length 63 and each codon satisfy a certain distribution (B. Brejova et al.) • HMM: average length 90, two adjacent codons are not independent (B. Brejova et al.)

  14. Observations • PH model: the hit probability of any seed with weight 11 and length 18 is at least 0.30 • M3 model: the hit probability of any seed with weight 11 and length 18 is at least 0.27 • HMM model: the hit probability of any seed with weight 11 and length 18 is at least 0.70

  15. Variant of MC Problem Can we have a better approximation ratio?

  16. Better Approximation Ratio If the sensitivity of each seed is at least and the optimal linear solution is , then the LP based randomized algorithm guarantees to generate a solution with approximation ratio at least for the seed-region specific OMS problem.

  17. Theoretical Results

  18. Practical Approximation Ratio the optimal seed set for the random region R the best seed set found by the LP based algorithm If 5000 regions are sampled, then we have with probability 0.99

  19. Practical Approximation Ratio (W=10)

  20. Practical Approximation Ratio (W=11)

  21. Test Data • All-against-all comparison between mouse EST sequences and human EST sequences by Smith-Waterman algorithm • 3346700 pairs found with local alignment score no less than 16

  22. Performance of PH Seeds

  23. Performance of HMM Seeds

  24. 4 HMM Seeds vs. 1 HMM Seed

  25. Greedy vs. LP

  26. Summary and Future Work • LP-based algorithm gives a mathematical foundation • LP-based algorithm is also good in practice • Time complexity is exponential to . Is there an approximation algorithm without enumerating seeds? • Better approximation ratio by Greedy algorithm?

More Related