1 / 19

Mining Complex Data

Mining Complex Data. COMP 790-90 Seminar Spring 2011. Mining Complex Patterns. Common Pattern Mining Tasks: Itemsets (transactional, unordered data) Sequences (temporal/positional: text, bioseqs) Tree patterns (semi-structured/XML data, web mining)

Télécharger la présentation

Mining Complex Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Complex Data COMP 790-90 Seminar Spring 2011

  2. Mining Complex Patterns • Common Pattern Mining Tasks: • Itemsets (transactional, unordered data) • Sequences (temporal/positional: text, bioseqs) • Tree patterns (semi-structured/XML data, web mining) • Graph patterns (protein structure, web data, social network)

  3. A B C D Tree Graph A B C D A A B C B C D D Example Pattern Types Itemset Sequence • Can add attributes • To nodes • To edges • Attributes • Labels • Type (directed or undirected ) • Set-valued

  4. Induced vs EmbeddedSub-trees • Induced Sub-trees: S = (Vs, Es) is a sub-tree of T = (V,E) if and only if • Vs⊆ V • e = (nx, ny) ∊ Es iff (nx, ny)∊ E (nx directly connected to ny) • Embedded Sub-trees: S = (Vs, Es) is a sub-tree of T = (V,E) if and only if • Vs⊆ V • e = (nx, ny) ∊ Es iff nx ≤l ny in T (nx connected to ny) • An induced sub-tree is a special case of embedded sub-tree. • We say S occurs in T and T contains S if S is an embedded sub-tree of T • If S has k nodes, we call it a k-sub-tree

  5. Mining Frequent Trees • Support: the support of a subtree in a database of trees, is the number of trees containing the subtree. • A subtree is frequent if its support is at least the minimum support. • TreeMiner: Given a database of trees (a forest) and a minimum support, find all frequent subtrees.

  6. String Representation of Trees n0 0 0 1 3 1 -1 2 -1 -1 2 -1 -1 2 -1 n6 n1 1 2 With N nodes, M branches, F max fanout Adjacency Matrix requires: N(F+1) space Adjacency List requires: 4N-2 space Tree requires (node, child, sibling): 3N space String representation requires: 2N-1 space n2 n5 3 2 n3 n4 1 2

  7. Tree: String Representation • Like an itemset • -1 as the backtrack item • Assuming only labels on nodes • For trees labels on edges can be treated as labels on nodes: edge-label+node-label = new label!

  8. Match labels Subtree Tree 0 [0,6] 0 0 n0 [1,5] 1 2 [6,6] 1 2 2 2 n6 n1 6 3 2 [2,4] 5 [5,5] n2 n5 1 2 [3,3] 4 [4,4] n4 n3 Match Label: 03456 Support: 1 3 vector < id, match label, scope >

  9. An example

  10. Generic Mining Algorithms • Horizontal pattern matching based • Vertical intersection based • BFS or DFS

  11. Candidate Generation &Support Counting • Candidate Generation • Extend by a node or an edge • Avoid duplicates as far as possible

  12. Trees: Systematic Candidate Generation Two subtrees are in the same class iff they share a common prefix string P up to the (k-1)th node A valid element x attached to only the nodes lying on the path from root to rightmost leaf in prefix P Not valid position: Prefix 3 4 2 x

  13. Candidate generation • Given an equivalence class of k-subtrees, how do we generate candidate (k+1)-subtrees? • Main idea: consider each ordered pair of elements in the class for extension, including self extension • Sort elements by node label and position

  14. Class extension

  15. Candidate Generation (Join operator) Self Join New Candidates Equivalence Class Prefix: 1 2, Elements: (3,1) (4,0) 1 1 1 1 1 1 1 2 2 2 2 2 4 3 3 2 4 2 3 3 3 3 Join 1 1 3 3 2 2 4 New Equivalence Class Prefix: 1 2 3 Elements: (3,1) (3,2) (4,0) 3

  16. 1 1 2 4 2 4 4 4 Self Join 1 1 2 4 2 4 Candidate Generation (Join operator) Join Equivalence Class Prefix: 1 2, Elements: (3,1) (4,0) 1 1 2 4 2 1 1 3 2 4 2 New Equivalence Class Prefix: 1 2 4 Elements: (4,0) (4,1) 3

  17. New Candidates Self Join 1 1 1 1 1 1 2 2 2 2 2 2 4 3 3 3 3 3 3 3 Join 4 3 1 1 2 2 3 4 Candidate Generation (Join operator) Equivalence Class Prefix: 1 2, Elements: (3,1) (4,1) 1 1 2 2 3 4 New Equivalence Class Prefix: 1 2 3 Elements: (3,1) (3,2) (4,1) (4,2)

  18. New Candidates Join 1 1 1 1 1 1 2 2 2 2 2 2 4 3 4 4 4 4 4 3 SelfJoin 4 3 1 1 2 2 4 4 Candidate Generation (Join operator) Equivalence Class Prefix: 1 2, Elements: (3,1) (4,1) 1 1 2 2 3 4 New Equivalence Class Prefix: 1 2 4 Elements: (3,1) (3,2) (4,1) (4,2)

  19. Apriori Style TreeMiner

More Related