1 / 26

Parameter Tuning for Differential Mining of String Patterns

Parameter Tuning for Differential Mining of String Patterns. J.Besson, C. Rigotti , I. Mitasiunaite and J.-F. Boulicaut. Tuning extraction parameters. Local pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings

Télécharger la présentation

Parameter Tuning for Differential Mining of String Patterns

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parameter Tuning for Differential Miningof String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut DDDM'08, Pisa - 15/12/2008

  2. Tuning extraction parameters • Local pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings • …. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …) • can select/focus …. • … where to look in the parameter space ? • often easy when a single threshold • … but when multiple constraints/multiple thresholds ? DDDM'08, Pisa - 15/12/2008

  3. Two different kinds of tuning • 1) exploratory stage: find in parameter space promising areas • 2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space DDDM'08, Pisa - 15/12/2008

  4. Tools ? • Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? … DDDM'08, Pisa - 15/12/2008

  5. Tools • GREP + Word Count • method: manual mix • count extracted patterns • choose points in parameter space • random walk • try local greedy strategy • having in mind known properties of the constraints (when applicable) and domain knowledge DDDM'08, Pisa - 15/12/2008

  6. Tools • … when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset … • perform more exhaustive exploration of pattern space • draw curves depicting the extraction landscape DDDM'08, Pisa - 15/12/2008

  7. Tools / landscape • Examples DDDM'08, Pisa - 15/12/2008

  8. Obtaining extraction landscapes • use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters) • use a global model of the presence of the local patterns to estimate the number of patterns • reuse/adapt a model - not so much exist • develop a new global model - each kind of patterns and each conjunction of constraints can be a research problem in itself • incorporate K of domain ? Global analytical model even more complex to exhibit … DDDM'08, Pisa - 15/12/2008

  9. What about sampling the pattern space ? • sounds too naive, needing complicated frameworks • how to sample ? • size of the sample ? • number of pattern in the sample that satisfy the constraints ? • using domain knowledge ? • how to estimate value for the whole pattern space ? DDDM'08, Pisa - 15/12/2008

  10. What about simple choices ? • sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints) • number of patterns in the sample that satisfy the constraints • compute probability to satisfy the constraints for each patterns (incorporate K of the domain) in the sample • approx. number of patterns that sat. the constraints (in the sample) • sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints • estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints DDDM'08, Pisa - 15/12/2008

  11. Whole process • 1) built an initial sample of Psynt • 2) comp. estimate of E(N) from the sample • 3) add more patt. to the sample • 4) comp. estimate of E(N) from the sample • 5) if estimate changes a lot goto 3) DDDM'08, Pisa - 15/12/2008

  12. Using it in freq. substring mining • Two datasets: R1 and R2 (two sets of strings) • Constraints • having size Z • appearing at least min times in R1 • appearing no more than max times in R2 • Consider exact and approx. matching DDDM'08, Pisa - 15/12/2008

  13. Pattern space and K of domain • string over an alphabet of 4 or 8 symbols • K of domain as three models of symbol distribution • Me - independent symbols with equal frequency • Md - independent symb. with different frequencies • Mm - first order Markov model • for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string • from binomial distribution we have the proba that p sat. min and max support constraints DDDM'08, Pisa - 15/12/2008

  14. Example / random data • 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2 , exact match DDDM'08, Pisa - 15/12/2008

  15. Example / random data • 4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match DDDM'08, Pisa - 15/12/2008

  16. Example / gene promoter seq. • 4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match DDDM'08, Pisa - 15/12/2008

  17. Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008

  18. Conclusion • Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling … • seems possible … • … at-least in some cases • … using simple framework • … incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints) • simplier than building a global analytical model • faster than running real extractions • … sufficient in exploratory stage ? • … companion software? DDDM'08, Pisa - 15/12/2008

  19. Example / random data • 8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match DDDM'08, Pisa - 15/12/2008

  20. Pb - Sampling / estimate • kind of sampling (with replacement ?) • specific sampling (ako stratified sampling) for some constraints ? • kinds of patterns ? • quality of estimates … occurrences of different patterns are not independent DDDM'08, Pisa - 15/12/2008

  21. Pb - Other parameters added • size of starting set • convergence criterion ? 5% ? • size of additional subsets • … not so hard to tune ? DDDM'08, Pisa - 15/12/2008

  22. Number of patterns • conjunction of constraints C • patterns in patt. space PS • for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C • N = nb of patt. that sat. C = sum of Xp over PS • E(N) = sum of E(Xp) over PS • E(Xp) = proba that p sat. C • Psynt = patt. in PS that sat. syntactic constraint in C • E(N) = sum of E(Xp) over Psynt DDDM'08, Pisa - 15/12/2008

  23. Number of patterns • comp. NS = sum of E(Xp) over a sample of Psynt • comp. ratio NR = NS/sample size • use NR * size of Psynt as an estimate of E(N) DDDM'08, Pisa - 15/12/2008

  24. Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008

  25. Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008

  26. Often repeat exploratory stage • redo exploratory stage after important changes as: • data selection (e.g., part of sequences) • encoding (e.g., mapping on event types) • discretization (e.g., threshold of binarization) • … DDDM'08, Pisa - 15/12/2008

More Related