430 likes | 574 Vues
Generating Semantic Annotations for Frequent Patterns Using Context Analysis. Topics Covered. Need. Evolution. What exactly are we looking for? Terms and Definitions. Modeling Context Patterns. Semantic Analysis and pattern annotation. Experiments and results. Related Work. Conclusions.
E N D
Generating Semantic Annotations for Frequent Patterns Using Context Analysis
Topics Covered • Need. • Evolution. • What exactly are we looking for? • Terms and Definitions. • Modeling Context Patterns. • Semantic Analysis and pattern annotation. • Experiments and results. • Related Work. • Conclusions. • Acknowledgements
Need • Fundamental focus of the data mining task. • Lack of information associated with the generated frequent pattern sets. • Analogous to the abstract of a paper.
Evolution of Pattern Generation • Researches towards presentation and interpretation of the discovered frequent patterns • Use of concepts life closed frequent pattern and maximal frequent pattern to shrink the size of the frequent pattern and provide information more than just “support”. • Use of other parameters to summarize frequent patterns namely “transaction coverage” and “pattern profiles”. • In spite of the added information the user still cannot interpret the hidden semantic information of the frequent pattern, and still has to go through the entire dataset to check whether it is worth exploring.
What exactly is a Semantic Annotation? • Cue from natural language processing. • Comparative thinking.
Example • Example of a Dictionary
Dictionary Pronunciation Definition Examples Synonyms and Thesaurus Frequent Patterns Context indicators Example transactions Semantically similar patterns Example (cont’d)
Example (cont’d) • What we are finally looking for
Problem Formulation • Considering the following: D => Database t => transactions pα => Pattern PD => Set of all patterns Hence we have: D = {t1,t2,t3,….tn} and PD = {p1,p2….pi} Dα = {ti| pα ti, ti D}
Terms and Definitions • Using the terminology we define the following terms: • Frequent Pattern. • Context Unit. • Pattern Context. • Semantic Annotations. • Context Modeling. • Transaction Extraction. • Semantically Similar Pattern (SSP). • Semantic Pattern Annotations (SPA).
Frequent pattern: A pattern p is frequent in a dataset D, if D /D >= , where is a user-specified threshold and is called the support of p, usually denoted as s(). • Context Unit: Given a dataset D and the set of frequent patterns PD, a context unit is a basic object in D which contains some semantic data and co-occurs with at least one in at least one transaction. • Pattern Context: Given a dataset D and a frequent pattern , the context is represented by the selected set of context units such that every co-occurs with . Each context unit is also called a context indicator. • Semantic Annotation: Let be a frequent pattern in a dataset D, U be the set of context indicators of p and P be a set of patterns in D, then a semantic annotation of p consists of: • A set of context indicators of p. • A set of transactions. • A set of patterns. • Context Modeling: Given a dataset D and a set of possible context units U, the problem of context modeling is to select a subset of U, define a strength measure w() for context indicators and construct a model of c() for each given pattern p.
Transaction Extraction: Given a dataset D, the problem of transaction extraction is to define a similarity measure between the transaction and the pattern context, and to extract a set of key transactions for frequent pattern. • Semantically Similar Pattern (SSP): Given a dataset D and a set of candidate patterns, the problem of Semantically Similar Patterns (SSP) extraction is to define a similarity measure between the contexts of two patterns and to extract a set of k patterns for any frequent pattern. • Semantic Pattern Annotation: Semantic Pattern Annotation consists of: • Select context units and define a weight function for them. • Design similarity measures • Extract significant context indicators.
We have no prior knowledge of how to model a context model. We do not have a clue of how to select context units if the set of possible context units is huge. It is not clear of how to analyze pattern semantics, thus the design of weighting functions and similarity measures is non-trivial. Since no training data is available, the learning is totally unsupervised. These above challenges however provide SPA with a flexibility, such that it doesn’t depend on any specific domain knowledge of the dataset. Challenges associated with Semantic Pattern Annotation (SPA)
Context Modeling • Vector Space Model (VSM). • Defining context modeling. • Generality of context modeling. • Context unit selection. • Strength weighting for context units.
Vector Space Model (VSM) • Used in natural language processing. • Use of ‘term vectors’ and ‘weight vectors’. • Why use Vector Space Model?
Context Model definition • Given a dataset D, a selected set of context units, we represent the context c() of a frequent pattern as a vector where wi = w (ui,α) and the weighting function is given by w(.,α). Hence a transaction t if represented as a vector <v1,v2,…vm>, where vi = 1 iff ui t, otherwise vi = 0.
Generality of context modeling • Summarization of Itemset patterns using Probabilistic Models. • Chao Wang & Srinivasan Parthasarathy
Context Unit Selection • Definition : A context unit may be defined as a minimal unit which holds semantic information in a dataset. • The choice of a particular unit is task dependent.
Granularity & Redundancy Removal • Definition: Granularity generally define the level of detail within a particular dataset. • Varied Granularity <=> Redundancy. • Redundancy removal techniques. • Existing techniques • Closed Frequent Pattern removal. • Micro-clustering.
Existing Techniques • Use of techniques such as pattern summarization and dimension reduction. 1) • Introduced by S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. • However, these are not efficient considering the scope of these techniques was meant to be dimension reduction for higher dimensional datasets; hence the dimensions are reduced but the redundancy remains the same. 2) • Introduced by X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. • Too focused to be of generalized use as in our case.
Closed Frequent Pattern • Looking at closed and maximal frequent patterns. • Drawbacks of maximal frequent patterns. • Why closed frequent patterns. • How?
Micro - clustering • Need for micro-clustering • Jaccard distance. • Types of micro-clustering • Hierarchical micro-clustering. • One-pass micro-clustering.
Micro – clustering • Both algorithms give us an output composed of frequent itemsets. • Both algorithms further assure that the distance between any 2 patterns is above a certain threshold.
Strength Weighting for Context Units • Weighting functions. • Concept of constraints. • Context indicator => U, • Frequent pattern => pα • A strength weighting function w(., pα) is good if : • w (ui, pα) <= w (pα, pα) : the best semantic indicator of pα is itself. • w (ui, pα) = w (pα, pα) : two patterns are equally strong to indicate the meaning of each other. • w (ui, pα) = 0. if the appearance of ui and pα is independent, ui cannot indicate the semantics of pα.
Semantic analysis and Pattern annotation • Semantic similarity: • Earlier we introduced the notion that the frequent patters are semantically similar if their contexts are similar to each other. • It can be defined formally as: Let be three frequent patterns in P and c(α), c(β), c(γ) are their context models. Let sim (c(.), c(.)) : Vk * Vk R+ be a similarity function of 2 context vectors. If sim (c(α), c(β)) > sim ( c(α), c(γ)), we say that p is semantically more similar to p than p w.r.t. sim (c(.), c(.)). • The cosine function is widely used to find the similarity between two vectors. • The formal cosine function is given by: • Where c(α) = <a1,a2…an> and c(β) = <b1,b2… bn>.
Extracting Strongest Context Indicators • Let p be a frequent pattern and c() be it’s context model, which can be defined as a context vector <w1,w2,…wk> over a set of context units U = {u1,u2,…uK}. As defined earlier w1 is a weight for u1 which states how well u1 indicates the semantics of p. • Therefore the goal of extracting strongest context indicator is to extract a subset of k’ context units such that and we have wi >= wj
Extracting Representative Transactions • Let p be a frequent pattern, c() be it’s context model and D ={t1….tl} be a set of transactions. Our goal is to select kt transactions with a similarity function. • Representation of the transaction as a vector. • Use of cosine function to find the similarity. • Compute from each transaction and rank them in descending order. • Select the top kt transactions.
Experiments and Results • To test the functioning of the proposed framework, we apply the proposed methods and algorithms to three different datasets of completely different backgrounds. • The three datasets are: • The DBLP dataset. • Matching of protein motifs. • Matching Gene Synonyms.
DBLP Dataset • A subset of the DBLP dataset is considered. • It contains papers from the proceedings of 12 major conferences in Database and Data Mining. • The data is stored in transactions containing 2 parts: • The author’s name. • The title of the corresponding paper. • Two patterns are considered: • Frequent co-authorship. • Frequent title terms. • The goal of this experiment is to explain the effectiveness of SPA in developing a dictionary-like annotation for the frequent patterns.
Experiments • One the basis of authors/ co-authors. • On the basis of the title of the papers presented. • For both the experiments, the closed frequent pattern and closed spanning methods were used to generate a set of closed frequent itemsets of authors/co-authors and a set of closed sequential patterns of title terms. • A technique called Krovetz Stemming is used top convert the title terms into their root forms.
Details • We set the minimum support for frequent itemsets as 10 and sequential patterns as 4. • Which outputs 9926 closed patterns. We use the One-pass microclustering algorithm to get a smaller set of 3443.
Matching motif’s and GO terms • Prediction of the functionality of newly discovered protein motifs. • Gene Ontology (GO). • The goal is to match each individual motif to the GO terms which best represent it’s functions. • Hence in this case the problem may be formulated as: • Given a set of transactions D (protein sequences with motifs), a set P of frequent patterns in D to be associated (motifs), a set of candidate patterns PC with explicit semantics (GO terms), our goal is for A pα P and find P’C PC, which best indicates the semantics of pα.
Details • We use the same dataset and judgments as used in T. Tao, C. Zhai, X. Lu, and H. Fang. A study of statistical methods for function prediction of protein motifs. • We have 12181 sequences, 1097 motifs and 3761 GO terms. • We use the same performance measure as in the above paper (i.e. a variant of MRR (Mean Reciprocal Rank)) to evaluate the effectiveness of the SPA technique on the motif-GO matching problem. • We formulate the problem as G = {g1, g2, … gn} be a set of GO terms, given a motif pattern , GO’ ={g1’, g2’,…gn’} G is a set of correct GO terms for the pattern. We rank G with the SPA system and pick the top ranked terms; where G is either termed as a context unit or as a semantically similar pattern to p.
Matching Gene Synonyms • In biomedical literature, it is common to call the same gene with different names, called gene synonyms. • These synonyms do not appear with each other but are replaceable with one-another. In this experiment we use the SPA technique to extract SSP ( Semantically Similar Patterns). • We construct a list of 100 synonyms, randomly selected from the BioCre-AtIvE Task 1B; which is basically a collection of abstracts of different papers.. We extract all sentence s which contain at least one synonym from the list, keeping the support above 3 we get a list of 41 synonyms. • We then mix the synonyms which belong to different genes and use the algorithm to extract the matching synonyms. • As in the previous case we use MRR to calculate the efficiency of the algorithm.
Related Work • To our knowledge the problem of semantic pattern annotation has not been well studied. • More frequent pattern mining work focus on discovering frequent patterns and do not address the problem of pre-processing. • The work proposed to shrink the size of the dataset are not efficient at removing redundancy in the patterns discovered. • None of the works provide information more than basic statistical information. • Recent researches develop techniques to approximate, summarize a frequent pattern. Although these explore some context information, none of them are provide in-depth semantic information. • Context analysis quite common in natural language, but focus more on non-redundant word based contexts, which is different than pattern contexts. • Although not optimal, the general methods proposed here can be well applied to these tasks.
Conclusions • Existing mining works generate a large set of frequent patterns without providing information to interpret them. • We propose the novel problem of semantic pattern annotation (SPA) – generating semantic annotations for frequent patterns • We propose algorithms to exploit context modeling and semantic analysis to generate semantic annotations automatically. The proposed methods are quite general and can deal with any type of frequent pattern with context information. • We evaluated out approach with 3 different datasets. The results of which show that our methods can generate semantic annotations efficiently. • As shown, the proposed methods can be applied to many interesting real world tasks through selecting different context units. • A major goal for future research is to fully develop the potential of the proposed framework by studying alternative instantiations.