290 likes | 402 Vues
This paper explores the generation of semantic annotations for frequent patterns within data mining. It emphasizes the importance of contextual analysis in understanding patterns' meanings and usability. The authors address challenges such as effectively representing the semantics of frequent patterns and inferring their meanings in a generalized manner. By developing a semantic annotations database, the study highlights potential applications and methods for extracting context indicators and representative transactions, ultimately leading to more meaningful pattern representations.
E N D
Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign June 6, 2014
Itemsets: diaper milk camera film ; ; … Sequential Patterns: ... MiningClosedFrequentGraph Patterns… … Mining Graph and Structured Patterns in ... … Subgraph Patterns: Frequent Pattern Mining( [Agrawal & Srikant 94] and many others) Database Frequent Patterns D E F C A B AB EF AE CD CE DE AF BE BF CDE ABE ABF
Toward Understanding the Patterns-- Find Canonical Patterns Database Frequent Patterns D E F C A B AB EF AE CD CE DE ( Yan et al ‘05) AF BE BF CDE ABE ABF ( Xin et al ‘05)
Toward Understanding the Patterns-- How to Interpret Patterns? • Do they all make sense? • What do they mean? • How are they useful? diaper beer female sterile (2) tekele morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only those with meanings… Our goal: Annotate patterns with semantic information
Challenges • How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?) • How can we infer pattern semantics? (How to annotate?) • How can we do it in a general way? (Do it for all kinds of patterns) • Once such annotations are generated, what can we use them for? (Applications)
Word: “pattern” – from Merriam-Webster Non-semantic info. Definitions indicating semantics Examples of Usage Synonyms Related Words A Dictionary Analogy
Pattern: “latent semantic analysis” Non-Semantic: sequential; close; sup = 0.1% Context Indicators (CI): “indexing”, “semantic”, “S. Dumais”, “singular value decomposition”, … Representative Transactions: index by latent semantic analysis probablist latent semantic analysis Semantically similar Patterns (SSP): “latent semantic indexing”, “LSA”, “PLSA” What about a “Pattern Dictionary”?-- Semantic Pattern Annotation (SPA) Word: Pattern Non-Semantic: function; pronunciation; date; etc. Definitions: A form or model proposed for … Related words: original, constellation … Examples: a dressmaker’s pattern a pattern of dissent Synonyms design, device, motif, motive…
Frequent Patterns P1: AB ? P2: CD P3: … Pn: How Can We Generate Such an Entry? Semantic Annotations Database … How to infer the semantics of a frequent pattern?
Context Pattern {A,B}:{ … Baby, Milk, Diaper, Toy, Soymilk… } {C,D}: { … Printer, Film, Camera, Lens, … } Continue the Analogy… “You shall know a word by the company it keeps.” - Firth 1957 Data … association … pattern … MINE … algorithm … mountain … Africa … diamond … MINE … weight … You’ll know the meaning of a pattern by its context
Context Units <E, F, …, EF, … ABE> <E, F, …, EF, …,CDEF> Context Units = Objects co-occurring with p Our Approach: Model the Context Semantic Annotations Database Frequent Patterns P1: AB P2: CD … … Pn:
Semantic Analysis with Context Models • Task1: Model the context of a frequent pattern Based on the Context Model… • Task2: Extract strongest context indicators • Task3: Extract representative transactions • Task4: Extractsemantically similar patterns
< 2.0, 2.0, …, 1.0, … , 1.0 > < 2.0, 2.0, …, 1.0, … , 1.0 > Co-occurrence Cosine Similarity Mutual Information Pearson Coefficient Context Unit Weight: Context Similarity: …… …… Task1: Context Modeling - A Vector Space Model Context Units Semantic Annotations Frequent Patterns Database <E, F, …, EF, … ABE> <E, F, …, EF, … ABE> P1: AB <E, F, …, EF, …,CDEF> … P2: CD … Pn:
Single items , , … diaper milk printer , itemsets milk lotion camera t2 transactions t1 Context Unit Selection t1 diaper milk babywear lotion t2 camera memory stick printer Valid Context Units: In general, Context Units are frequent patterns
Context Unit Selection: Redundancy Removal • Problem: too many valid context units, most are redundant • { Diaper, milk, babywear }: “diaper”, “diaper, milk”, “milk, babywear”, “milk, lotion”, … • Solution: • use close patterns • micro-clustering: (hierarchical, one-pass) • Jaccard Distance (γ: threshold to stop clustering):
Context Unit Weighting < 3.0, 0, … 2.0, … , 1.0, …> AB 3.0EF 2.0ABE 1.0… Task2: Extract Context Indicators Semantic Annotations Context Units Frequent Patterns Database < AB, CD, … , EF, … ABE, …> <A, B, AB, C, D, CD, E, F, EF, AE, BF, … ABE, ABF,…, ABEF> P1: AB … P2: CD … Pn:
T1: 1.0, 0, …,1.0, … , 1.0 T5: Semantic Similarity T5 0.8T1 0.6T3 0.6… Task3: Extract Representative Transactions Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …
P2: CD 0, 3.0, …,2.0, … , 0.5 Pk: EF Semantic Similarity CD 0.7BF 0.5EF 0.3… AB: Task4: Extract Semantically Similar Patterns Semantic Annotations Database Frequent Patterns Context Units < AB, CD, … , EF, … ABE, …> P1: AB 3.0, 0, …,2.0, … , 1.0 …
Experiments • Three different real world applications • Annotating DBLP title/authors Patterns • Motif/Gene-Ontology (GO) matching • Gene Synonyms extraction • Study the effectiveness of the proposed SPA methods • Explore applications of SPA to different real world tasks
P1: { x_yan, j_han } Frequent Itemset P2: “substructure search” Frequent Sequential Pattern Context Units < { p_yu, j_han}, { d_xin }, … , “graph pattern”, … “substructure similarity”, … > Annotating DBLP Co-authorship and Title Pattern Database: Frequent Patterns Authors Title X.Yan, P. Yu, J. Han Substructure Similarity Search in Graph Databases … … … … Semantic Annotations
DBLP Results: Frequent Itemset Pattern= {xifeng_yan, jiawei_han} Annotations:
DBLP Results: Freq. Seq. Pattern Pattern= “Information … retrieval” Annotations:
GO term 1 Sequence 1 GO term 2 motif1 motif2 Sequence 2 GO term 3 motif2 motif3 GO term 4 Sequence 3 GO term 5 motif2 motif4 motif5 Motif-GO Matching ? motif2 Motif: a subsequence pattern in the sequences Gene Ontology (GO) terms: annotating the functionality of sequence, motifs
Motif 1 P1: Motif1 Sequential Pattern P2: GOTerm2 Single Item Pattern Context Units < Motif1, Motif3, …, GOTerm1, GOTerm2, … > Motif-GO Matching (Cont.) Database: Frequent Patterns Protein Sequence GO terms GOTerm1; GOTerm2;GOTerm3 GOTerm3 … … Motif-GO matching Semantic Annotations
Motif/GO Matching: Evaluation • Gold standard generated by human experts • Measure: Mean reciprocal rank (MRR) • Reflects ranking accuracy (the higher the better) • 1/Rank (0.5 means the correct answer is ranked as the 2nd ) • Results: Weights for Context Units: Ranking Strategy
Gene Synonym Extraction • Gene Synonyms: • A Sequential Pattern in the textual database • Matching gene synonyms: a challenging and important new problem in mining biology data • Analogy: thesaurus or synonyms in dictionary
P1: female sterile (2) tekele Sequential Pattern P2: Fs(2)Tek Sequential Pattern Context Units < gene, female, …, d. melanogaster gene, … > Context Units: context units can be single words or sequential patterns Gene Synonym Extraction (Cont.) Database: Frequent Patterns Biomedical Sentences … D. melanogaster gene Female sterile (2) Tekele … … Female sterile (2) Tekele , abbreviated as Fs(2)Tek … … Matched Synonyms Semantic Annotations
Gene Synonym Extraction: Results MRR: hierarchical MRR: one-pass • Effective! MRR > 0.5 • frequent pattern >> single words • Micro-clustering is useful Running time: hierarchical Running time: one-pass
Conclusions • A novel problem: semantical pattern annotation • A structured annotation for frequent patterns • A general method based on context modeling • A general post-processing procedure of frequent pattern mining on any types of pattern • Applicable to and effective for quite different tasks • Future work: • Tune for specific tasks • Better context unit weights, redundancy removal, etc