Geographic Data Mining

Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004

About … • A form of geographical analysis • Current topic of interest in GIS research (and database research and AI research) • Finding hidden information in large collections of geographic data

This seminar • Learning about a topic together • Presenting to each other + interaction • Added value by good examples: • for important concepts, algorithms • possibly self-thought of, or extended • referring to GIS data and issues (hence the GIS course prerequisite) • Written assignment: joint survey

Material • Book by Harvey Miller and Jiawei Han (editors): selected chapters • Possibly: papers from conference proceedings • Mostly provided by me

Weeks • Week 36-46 • Probably: • Not September 4 (this Thursday) • Not in week 40 (Sept. 29 & Oct. 2) • Not October 23 • The above depending on participation!

Overview of Geographic Data Mining & Knowledge Discovery Chapter 1 of the book • KDD: knowledge discovery in databases • Data warehouses • Data mining • Geographic aspects of the above

Knowledge Discovery in Databases (KDD) • Large databases contain interesting patterns: non-random properties and relationships that are: • valid (general enough to apply to new data) • novel (non-trivial and unexpected) • useful (leads to effective action: decision making or investigation) • ultimately understandable (simple, and interpretable by humans)

Knowledge Discovery in Databases (KDD) • Because of quantity of data nowadays • Because we want information, not data • Because computing power allows it

KDD opposed to statistics • Statistics • small and clean numeric database • scientifically sampled • specific questions in mind • KDD: none of the above

KDD techniques • Statistics • Machine learning • Pattern recognition • Numeric search (?) • Scientific visualization

Data warehouse • Large repository of data • For analytical processing (DB: transactional processing) • Heterogeneous: different sources and formats (DB: homogeneous) • Supports OLAP tools (OnLine Analytical Processing)

OLAP Example • Measure of interest: sales • Dimensions of interest: item, store, week • (item, store, week)  money[quantity sold times price ]

OLAP Example • 2-dim. aggregation:(item, store, . )  money • Another 2-dim. aggregation: sales by store and by week • 1-dim. aggregation: sales by week (all items and stores) • Data cube: all 2d possible aggregations, different types of summaries

KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining • Interpretation and reporting Presence of steps and order not fixed

KDD steps • Data selection: which records, variables chosen?

KDD steps • Data selection • Data pre-processing: removing noise, duplicate records, handling missing data, …

KDD steps • Data selection • Data pre-processing • Data enrichment: combining the selected data with external data

KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection: reduction in number, reducing dimension

KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining: uncovering information, interesting patterns

KDD steps • Data selection • Data pre-processing • Data enrichment • Data reduction and projection • Data mining • Interpretation and reporting: evaluating, understanding, communicating

Data mining • Segmentation • Dependency analysis • Deviation and outlier analysis • Trend detection • Generalization and characterization

Description: Clustering: finding a finite set of implicit classes Classification: mapping data items into pre-defined classes Techniques: Cluster analysis Bayesian classification Decision or classification trees Artificial neural networks DM - segmentation

given classes clustering classification DM - segmentation

Description: Finding rules to predict the value of some attribute based on other attributes Techniques: Bayesian networks Association rules DM – dependency analysis (4, 12, 0.24) (3, 14, 0.21) (7, 13, 0.43) (2, 9, 0.78) (11, 11, 0.55) (5, 11, ???) (???, 12, 0.51)

DM – dependency analysis • Confidence and support measures for association rules of the form:[ if X then Y ]:confidence = #(X and Y in DB) / #(X in DB)support = #(X and Y in DB) / #(all in DB)

Description: Finding data with unusual deviations (=errors, or data of particular interest) Techniques: Clustering, other mining methods Outlier analysis DM – deviation & outlier analysis

Description: Finding lines, curves, summarizing the database (often as a function over time) Techniques: Regression Sequential pattern extraction DM – trend detection

Description: Obtaining compact descriptions of the data Techniques: Summary rules Attribute-oriented induction higher level concept low level concept DM – generalization and characterization concept hierarchy

Visualization and knowledge discovery • KDD is difficult to automate  steered by human intelligence • Visualization helps to understand the data and which data mining techniques to try

KD + geography • Special case of KDD • Other special cases • marketing • biology • astronomy • Main features: location, distance, dimen-sionality (with dependent dimensions)

KD + geography (attr1, attr2, attr3, attr4); attr’s are numbers and (relatively) independent: statistics (attr1, attr2, attr3, attr4); attr’s can also be on other measurement scales: KDD (attr1, attr2, attr3, attr4); attr’s are often dependent and can be shapes: KD + geography Often: (lat., long., attr1, attr2, …) or: (shape description, attr1, attr2, …)

KD + geography • Study of scalable versions of DM tasks (in lat. and long.) • Certain dimensions can be non-metric (travel time need not be symmetric) • DM in data that is not in the form of tuples: sets of thematic map layers

Geographic data mining • Spatial segmentation (clustering, classification) • Spatial dependency (spatial association rules) • Spatial trend detection • Geographic characterization and generalization

GDM – spatial association rules • Example: If a location is within 500 m from water and the average winter temperature is at least –2 degreesthen there are frogs around distance relationship

GDM – spatial trend detection • Patterns of change with respect to neighborhood of some object • Example: (North America) Further from Pacific ocean  fewer earthquakes

GDM - applications • Map interpretation • Remote sensing interpretation • Environmental mapping (soil type, etc.) • Extracting spatio-temporal patterns for cyclones, crimes • Spatial interaction (movement/flow of people, capital, goods)

Conclusions • GDM & GKD is an extension of (tool for) geographical analysis • GDM is different from DM due to • Geographic spaces, not attribute space • Neighborhood is extremely important • Scale issues • Data is different • Applications (interesting patterns to mine for) are different

This seminar on GDM • First: chapters from the book • CH 1: GDM & KD: an overview (today) • CH 2: Paradigms for spatial and spatio-temporal DM(11-9) • CH 3: Fundamentals of spatial DW for GKD (15-9) • CH 7: Algorithms and applications of SDM (Ronny) (18-9) • CH 8: Spatial clustering in DM (22-9) • CH 6: Modeling spatial dependencies (25-9)(not: 29-9 and 2-10) • CH 9: Detecting outliers (6-10) • CH 10: Knowledge construction based on GVis and KDD • CH 14: Mining mobile trajectories

This seminar • All PowerPoint presentations on the Web page of the course • Survey paper or written exam; possible topics for survey: • Hierarchical clustering • Clustering with obstacles • Proximity relationship mining • … • Or: joint survey of (geometric) algorithms for GDM

Each presentation • The chapter contents • Additional (spatial) examples(from the Web links or self-constructed) • Detect and present algorithmic problems that appear  together: report on algorithmic issues in GDM • Present your chapter; don’t be afraid of overlap with other chapters

Geographic Data Mining

Geographic Data Mining

Presentation Transcript

Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Geographic Data Primitives

Serving Geographic Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Geographic data validation

Geographic Data Models

Data Mining: Data

Geographic Data Mining

Data Mining: Data

Data Mining: Data