260 likes | 383 Vues
This paper discusses parameter tuning in local pattern mining, particularly focusing on the extraction of itemsets, closed itemsets, episodes, and substrings under various constraints. It outlines two tuning stages: exploratory and fine-grain tuning, utilizing tools like GREP and Word Count. The study emphasizes sampling techniques to estimate the number of patterns satisfying constraints, exploring extraction landscapes, and integrating domain knowledge for effective parameter setting. Results from experiments on real datasets, such as gene promoter sequences, showcase the practical applications of the proposed methods.
E N D
Parameter Tuning for Differential Miningof String Patterns J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut DDDM'08, Pisa - 15/12/2008
Tuning extraction parameters • Local pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings • …. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …) • can select/focus …. • … where to look in the parameter space ? • often easy when a single threshold • … but when multiple constraints/multiple thresholds ? DDDM'08, Pisa - 15/12/2008
Two different kinds of tuning • 1) exploratory stage: find in parameter space promising areas • 2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space DDDM'08, Pisa - 15/12/2008
Tools ? • Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? … DDDM'08, Pisa - 15/12/2008
Tools • GREP + Word Count • method: manual mix • count extracted patterns • choose points in parameter space • random walk • try local greedy strategy • having in mind known properties of the constraints (when applicable) and domain knowledge DDDM'08, Pisa - 15/12/2008
Tools • … when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset … • perform more exhaustive exploration of pattern space • draw curves depicting the extraction landscape DDDM'08, Pisa - 15/12/2008
Tools / landscape • Examples DDDM'08, Pisa - 15/12/2008
Obtaining extraction landscapes • use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters) • use a global model of the presence of the local patterns to estimate the number of patterns • reuse/adapt a model - not so much exist • develop a new global model - each kind of patterns and each conjunction of constraints can be a research problem in itself • incorporate K of domain ? Global analytical model even more complex to exhibit … DDDM'08, Pisa - 15/12/2008
What about sampling the pattern space ? • sounds too naive, needing complicated frameworks • how to sample ? • size of the sample ? • number of pattern in the sample that satisfy the constraints ? • using domain knowledge ? • how to estimate value for the whole pattern space ? DDDM'08, Pisa - 15/12/2008
What about simple choices ? • sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints) • number of patterns in the sample that satisfy the constraints • compute probability to satisfy the constraints for each patterns (incorporate K of the domain) in the sample • approx. number of patterns that sat. the constraints (in the sample) • sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints • estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints DDDM'08, Pisa - 15/12/2008
Whole process • 1) built an initial sample of Psynt • 2) comp. estimate of E(N) from the sample • 3) add more patt. to the sample • 4) comp. estimate of E(N) from the sample • 5) if estimate changes a lot goto 3) DDDM'08, Pisa - 15/12/2008
Using it in freq. substring mining • Two datasets: R1 and R2 (two sets of strings) • Constraints • having size Z • appearing at least min times in R1 • appearing no more than max times in R2 • Consider exact and approx. matching DDDM'08, Pisa - 15/12/2008
Pattern space and K of domain • string over an alphabet of 4 or 8 symbols • K of domain as three models of symbol distribution • Me - independent symbols with equal frequency • Md - independent symb. with different frequencies • Mm - first order Markov model • for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string • from binomial distribution we have the proba that p sat. min and max support constraints DDDM'08, Pisa - 15/12/2008
Example / random data • 4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2 , exact match DDDM'08, Pisa - 15/12/2008
Example / random data • 4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match DDDM'08, Pisa - 15/12/2008
Example / gene promoter seq. • 4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match DDDM'08, Pisa - 15/12/2008
Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008
Conclusion • Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling … • seems possible … • … at-least in some cases • … using simple framework • … incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints) • simplier than building a global analytical model • faster than running real extractions • … sufficient in exploratory stage ? • … companion software? DDDM'08, Pisa - 15/12/2008
Example / random data • 8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match DDDM'08, Pisa - 15/12/2008
Pb - Sampling / estimate • kind of sampling (with replacement ?) • specific sampling (ako stratified sampling) for some constraints ? • kinds of patterns ? • quality of estimates … occurrences of different patterns are not independent DDDM'08, Pisa - 15/12/2008
Pb - Other parameters added • size of starting set • convergence criterion ? 5% ? • size of additional subsets • … not so hard to tune ? DDDM'08, Pisa - 15/12/2008
Number of patterns • conjunction of constraints C • patterns in patt. space PS • for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C • N = nb of patt. that sat. C = sum of Xp over PS • E(N) = sum of E(Xp) over PS • E(Xp) = proba that p sat. C • Psynt = patt. in PS that sat. syntactic constraint in C • E(N) = sum of E(Xp) over Psynt DDDM'08, Pisa - 15/12/2008
Number of patterns • comp. NS = sum of E(Xp) over a sample of Psynt • comp. ratio NR = NS/sample size • use NR * size of Psynt as an estimate of E(N) DDDM'08, Pisa - 15/12/2008
Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008
Example / gene promoter seq. • Estimate vs. extraction DDDM'08, Pisa - 15/12/2008
Often repeat exploratory stage • redo exploratory stage after important changes as: • data selection (e.g., part of sequences) • encoding (e.g., mapping on event types) • discretization (e.g., threshold of binarization) • … DDDM'08, Pisa - 15/12/2008