slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Mining Complex Types of Data PowerPoint Presentation
Download Presentation
Mining Complex Types of Data

Mining Complex Types of Data

195 Views Download Presentation
Download Presentation

Mining Complex Types of Data

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Mining Complex Types of Data 2004/10/29

  2. Outline • 1. Generalization of Structured Data • 2. Mining Spatial Databases • 3. Mining Time-Series and Sequence Data • 4. Mining Text Databases • 5. Mining the World Wide Web

  3. 1. Generalization of Structured Data • Generalization means a reduction of attribute value to a certain (small) set of categories (concept hierarchy). • This reduction often require the existence of background knowledge. • E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games}

  4. Generalization Based Knowledge Discovery • Requires existence of background knowledge (concept hierarchies) for both spatial and non-spatial data. • Concept hierarchies are typically given by domain experts.

  5. Spatial Attribute Concept Hierarchy

  6. An Example: Plan Mining by Divide and Conquer • Plan: a variable sequence of actions • E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price, seat> • Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) • E.g., Discover travel patterns in an air flight database, or • find significant patterns from the sequences of actions in the repair of automobiles • Method • Attribute-oriented induction on sequence data • A generalized travel plan: <small-big*-small> • Divide & conquer:Mine characteristics for each subsequence • E.g., big*: same airline, small-big: nearby region

  7. A Travel Database for Plan Mining • Example: Mining a travel planbase Travel plans table

  8. Strategy Generalize the planbase in different directions Look for sequential patterns in the generalized plans Derive high-level plans Multidimensional Analysis A multi-D model for the planbase

  9. Multidimensional Generalization Multi-D generalization of the planbase Merging consecutive, identical actions in plans

  10. Generalization-Based Sequence Mining • Generalize planbase in multidimensional way using dimension tables • Use # of distinct values (cardinality) at each level to determine the right level of generalization (level-“planning”) • Use operators merge“+”, option“[]” to further generalize patterns • Retain patterns with significant support

  11. Generalized Sequence Patterns • AirportSize-sequence survives the min threshold (after applying merge operator): S-L+-S [35%], L+-S [30%], S-L+ [24.5%], L+ [9%] • After applying option operator: [S]-L+-[S] [98.5%] • Most of the time, people fly via large airports to get to final destination • Other plans: 1.5% of chances, there are other patterns: S-S, L-S-L

  12. 2. Mining Spatial Databases • Introduction • Spatial Association Rules • Spatial Clustering • Spatial Classification

  13. Introduction • Spatial data • spatial data contain some geometrical information • Objects are defined by points, lines, polygons. • Objects in the spatial database represent real-world entities (e.g., rivers) with associated attributes (e.g., flow, depth, etc.). • Objects usually are described with both spatial and nonspatial attributes. • Multidimentional trees are used to build indices for spatial data in spatial databases • E.g., quad trees, k-d trees, R-trees.

  14. Database primitives for spatial mining • Topology A covers B B covered-by A

  15. Database primitives for spatial mining • Distance

  16. Database primitives for spatial mining • Direction

  17. Spatial data mining • Discover interesting spatial patterns and features • Capture intrinsic relationships between spatial and non-spatial data • Applications • GIS • Image database exploration

  18. Spatial Association Rules • A spatial association rule is an association rule containing at least one spatial neighborhood relation • Topological relations: intersects, overlaps, disjoins, etc. • Direction relations: north, east, south_west, etc. • Distance relations: close_to, far_away, etc.

  19. Example: Spatial Associations Answers: and

  20. oasis → elephants in neighbourhood wildebeests → lions in neighbourhood

  21. lots of cheetahs → fewer zebras no zebras → fewer cheetahs

  22. Hierarchy of spatial neighborhood relations • "g_close_to" may be specialized to near_by, touch, intersect, contain, etc. • Basic idea: if two objects do not fulfill a rough relationship (such as intersect), they cannot fulfill a refined relationship (such as meet).

  23. Using tree to explore: • Collect task-relevant data. • Computation starts at high level of spatial predicates like g_close_to. • Utilize spatial indexing methods. • For those pattern that pass the filtering at the high levels, do further refinements at the lower levels, like adjacent_to, intersects, distance_less_than_x, etc. • Filter out those patterns that do not exceed Minimum Support Threshold or Minimum Confidence Threshold. • Derive the strong association rules!

  24. Example

  25. The map of British Columbia

  26. Representation of spatial objects

  27. Hierarchies for data relations

  28. Hierarchies for data relations

  29. 40 large towns in B.C. min_support=50%

  30. Level-1

  31. Level-2 min_support is reduced to 25%

  32. Level-3 min_support is reduced to 15%

  33. Two-step procedure for discovering spatial neighborhood relations • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Step 2: detailed spatial algorithm (as refinement) • Is very expensive (e.g. intersect test). • Apply only to those objects which have passed the rough spatial association test (no less than min_support).

  34. Spatial Classification • A number of questions can be associated with spatial classification • Which attributes or predicates are relevant to the classification process? • How should one determine the size of the buffers that produce classes with high purity? • Can one accelerate the process of finding relevant predicates?

  35. Example: What Kind of Houses Are Highly Valued? H H H L H H H L L H L L L H L L H H H C01 H H H H H H L H H L L H L H L L L H H L L L Highway lake

  36. An efficient two-step method for classification of spatial data • Step 1: rough spatial computation (as a filter) • Using MBR or R-tree for rough estimation • Using nearest neighbor approach to find relevant predicates • Step 2: detailed computation (as refinement) • Only the relevant predicates are computed in detail for all classified objects • In the construction of the decision tree, the information gain utilized in ID3 is used

  37. High_value High_value High_value

  38. Spatial Clustering • Clustering in spatial data mining is to group similar objects based on their distance, connectivity or their relative density in space. • In the real word, there exist many physical obstacles such as rivers, lakes and highways ,and their presence may affect the result of clustering substantially.

  39. Infected water pump ? Disease Cluster Example: Spatial Cluster • 1854 cholera epidemic London map

  40. Clustering data objects with constraints

  41. Planning the locations of ATMs C3 C2 Bridge C1 River Mountain C4 Spatial data with obstacles Clustering without taking obstacles into consideration

  42. Not Taking obstacles into account Taking obstacles into account