1 / 73

Multi-seed lossless filtration

Multi-seed lossless filtration. Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM ( Istanbul ) July 5-7, 2004. potential matches. Text filtration: general principle.

iona
Télécharger la présentation

Multi-seed lossless filtration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004

  2. potential matches Text filtration: general principle

  3. Text filtration: general principle potential matches

  4. Text filtration: general principle lossless and lossy filters true match

  5. potential similarities Filtration applied to local similarity search

  6. Filtration applied to local similarity search potential similarities

  7. Filtration applied to local similarity search true similarities

  8. Gapless similarities. Hamming distance. • Similarities are defined through Hamming distance GCTACGACTTCGAGCTGC ...CTCAGCTATGACCTCGAGCGGCCTATCTA...

  9. Gapless similarities. Hamming distance. • Similarities are defined through Hamming distance

  10. m Gapless similarities. Hamming distance. • Similarities are defined through Hamming distance • (m,k)-problem, (m,k)-instances k

  11. m Gapless similarities. Hamming distance. • Similarities are defined through Hamming distance • (m,k)-problem, (m,k)-instances • This work: lossless filtering k

  12. m=18 #### #########(1) Lossless filtering by contiguous fragment (m,k) • PEX (Navarro&Raffinot 2002) • Searching for a contiguous pattern • PEX with errors • Searching for a contiguouspattern with l possible errors • Efficient only for small alphabets and small l k=3

  13. #---#---#---# #---#---#---# #---#---#---# #---#---#---# k+1 Superposition of two filters Pevzner&Waterman 1995 Idea: combine PEX withanother filterbased on a regularly-spaced seed • PEX : • spaced PEX (matches occurring at every k positions). #### #---#---#---#

  14. Spaced seeds • Spaced seeds (spaced q-grams) • proposed byBurkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-problems • Principle • Searching for spaced rather than contiguous patterns • Selectivity • defined by the weightof the seed (numberof #’s) ###-##

  15. Example: (18,3)-problem ###-## ###-## ###-## ###-## ###-## ###-##

  16. Spaced seeds for sequence comparison • Ma, Tromp, Li 2002 (PatternHunter) • Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004, ... • Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004, ...

  17. Families of spaced seeds This work:lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001) • single filter based on several distinct seeds • each seed detects a part of (m,k)-instances but together they must detect all (m,k)-instances Independent work (lossy seed families for sequence alignment): • Li, Ma, Kisman, Tromp 2004 (PatternHunter II) • Xu, Brown, Li, Ma, this conference • Sun, Buhler, RECOMB 2004 (Mandala)

  18. ##-#-#### ###---#--##-# F Example: (18.3)-problem (cont) • every (18,3)-instance contains an occurrence of a seed of F • all seeds of the family have the same weight 7 FamilyFsolves the(18,3)-problem

  19. ##-#-#### ###---#--##-# ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Example: (18.3)-problem (cont) w=7 ###---#--##-# ###---#--##-# w=9 ###-##---#-###

  20. ##-#-#### ###---#--##-# ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Comparative selectivity Selectivity of families onBernoulli similarities (p(match)= 1/4) estimated as the probability for one of the seeds to occur at a given position w=4 ~39.10-4 #### w=5 ~9.810-4 ###-## w=7 ~1.2 10-4 w=9 ~0.23 10-4

  21. How far should we go • A trivial extreme solution • … would be to pick allseeds of weightm – k • selectivity 100% (no false positives) • prohibitive cost except for very small problems • We are interested in intermediate solutions: • relatively small number of seeds (< 10) to keep the hash table of a reasonable size, • the seed weight sufficiently large to obtain a good selectivity

  22. Results • Computing properties of seed families • Seed design • Seed expansion/contraction • Periodic seeds • Seed optimality • Heuristic seed design • Experiments • Examples of designed seed families • Application to computing specific oligonucleotides

  23. Measuringthe efficiency of a family • Burkhard&Karkkainen: optimal threshold of a seed: minimal number of seed occurrences over all (m,k)-instances • A seed family F is lossless iff the optimal threshold TF(m,k)1 • TF(m,k) can be computed by a dynamic programming algorithm in time O(m·k·2(S+1)) and space O(k·2(S+1)), where S is the maximal length of a seed from F • optimizations are possible (see the paper) • the resulting space and time complexity is the same as in the Burkhard&Karkkainen algorithm

  24. Measuringthe efficiency of a family (cont) Using a similar DP technique we can compute, within the same time complexity bound: • the number UF(m,k) of undetected (m,k)-similarities for a (lossy) family F • the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed [see the paper for details]

  25. Design of seed families Pruning exhaustive search tree(Burkhard&Karkkainen) • Construct all solutions of weightwfrom solutions of weightw – 1 • Example: if##--#--#and##-#---#are solutions of weightw-1, considertheir «union» ##-##--#of weightw. • Prohibitive cost: • more than a week for computing all single-seed solutions of the (50,5)-problem • the search space blows up for multi-seed families

  26. Seed expansion/contraction Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---#

  27. Seed expansion/contraction Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# the only solution of weight 12 of the (25,2)-problem

  28. Seed expansion/contraction Burkhard&Karkkainen: the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# • Letbe thei-regular expansion of F obtained by inserting i-1jokers between successive positions of each seed of F • Example: IfF = {###-# , ##-##}then = {#-#-#---# , #-#---#-#} = {#--#--#-----#, #--#-----#--#} the only solution of weight 12 of the (25,2)-problem

  29. ##-#-#### ###---#--##-# ##-#-#### ###---#--##-# #-#---#---#-#-#-# #-#-#-------#-----#-#-# Seed expansion/contraction(cont) Lemma: • If a familyF solvesan (m,k)–problem, thenbothF and solvesthe(i·m, (i+1)·k- 1)–problem • If a familysolvesthe(i·m,k)–problem, then itsi-contraction Fsolvesthe(m, )-problem (18,3) (36,7)

  30. Periodic seeds Iterating shortseedswith good properties into longer seeds ###-#-- ###-#--###-#--###-#

  31. Cyclic problem --##-# ###-#--#--- Cyclic (11,3)-problem Linear (29,3)-problem ###-#--#---###-#--# Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Qi=[Q,- (m-s(Q))]i solves the linear (m·(i+1)+s(Q)-1,k)-problem.

  32. ###-#--#---###-#--# #--#---###-#--#---### Extension to multi-seed case ###-#--#--- Cyclic (11,3)-problem Linear (25,3)-problem

  33. ###-#--#---###-#--# #--#---###-#--#---### Extension to multi-seed case ###-#--#--- Cyclic (11,3)-problem Linear (25,3)-problem

  34. Asymptotic optimality Theorem: Fix a number of errors k. Let w(m) be the maximal weight of a seed solving the linear (m,k)-problem. Then • the fraction of the number of jokers tends to 0 but the convergence speed depends on k • seed expansion cannot provide an asymptotically optimal solution ) (

  35. Non-asymptotic optimality • Fix a number of errors k. • For each seed (seed family) Q there exists mQ s.t. mmQ, Q solves the (m,k)-problem • For a class of seeds , Q is an optimal seed in  iff Q realizes the minimal mQ over all seeds of  Lemma: Let n be an integer and r=n/3. For every k2, seed #n-r-#r is optimal among seeds of weight n with one joker. Example: ####-## is optimal among the seeds of weight 6 with 1 joker: it solves all (m,2)-problems for m≥16, all (m,3)-problems for m≥20

  36. Heuristic seed design: genetic algorithm difficult (m,k)-instances • a population of seed families is evolving by mutating and crossing over • seed families are screened against sets of difficult (m,k)-instances • for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families do • compute the contribution of each seed of the family. Mutate the least “valuable” seeds. select selectandreorder seed families

  37. Example: (25,2)-problem

  38. Application of lossless filtering: oligo design • Specific oligonucleotides: small DNA molecules (10-50bp) that hybridize with a target sequence and do not hybridize with background sequences (e.g. the rest of the genome) • Formalization: given a sequence (or database), find all windows of length m which do not occur elsewhere within k substitution errors

  39. Seed design: (32,5)-problem

  40. Experiment • This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp) • All 32-windows occurring elsewhere within 5 errors have been computed • The computation took slightly more than 1 hour on a P4 3GHz computer • 87% of the database have been “filtered out”

  41. Further questions • Combinatorial structure of optimal seed families • Efficient design algorithm

  42. Questions ? ? ?

  43. Conclusions • Méthode de filtrage pour pattern-matching approché • Basée sur le design et l’utilisation d’une famille de graines espacées. • Sélective en pratique mais nécessite un effort de calcul pour le design des graines. • Extensions possibles • Considérer des graines espacées autorisant une erreur. • Problèmes ouverts • Un algorithme efficace pour le design de la famille de graines optimale ?

  44. Références [1] S. Burkhardt and J. Kärkkäinen, Better Filtering with Gapped q-Grams, Fundamenta Informaticae, 23:1001-1018 2003 [2] P.Pevzner and M.Waterman, Multiple Filtration and Approximate Pattern Matching, Algorithmica 13(1/2), 135-154 1995 [3] J.SantaLucia, A unified view of polymer and oligonucleotide DNA nearest-neighbor thermodynamics, Biochemistry 95:1460-1465 1998 [4] G.Navarro and M.Raffinot, Flexible Pattern Matching in Strings -- Practical on-line search algorithms for texts, Cambridge University Press 2002 [5] …

  45. Problème posé • Problème biologique Oligonucléotide : fragment d’ADN de taille fixée qui ne s’apparie qu’avec une région déterminée sur une séquence cible. Rechercher les oligonucléotides spécifiques à une séquence. Design d’oligos • Puces à ADN. Design d’amorces • PCR

  46. Problème posé Spécificité • Etant données: • Une séquence cible S • Une séquence de fond B • Trouver un motif de taille m qui s’apparie avec une région de Set aucune région de B

  47. Problème posé • Comment définir un oligonucléotides spécifique? • C’est un fragment d’ADN M de taille fixée m. • Il doit être spécifique : • s’apparier avec une région d’une séquence cible S appariement exact • être éloignée de tout fragment d’un séquence de fond B.

  48. Exemple • Sur le problème (m=18,k=3) ###.## ###.## ###.## ###.## ###.## ###.##

  49. Exemple • Sur le problème (m=18,k=3) ###.## ###.## ###.## ###.## ###.## ###.##

  50. Exemple • Sur le problème (m=18,k=3) ###.## ###.## ###.## ###.## ###.## ###.##

More Related