Introduction to Spatial Data Management: Concepts, Techniques, and Trends

Agenda Today • We will discuss a few interesting spatial data mining patterns • Then come back to summarize what we have learned in this course so far

Spatial Data Management: Summary

Course Summary • 1. Introduction to Spatial Databases • 2. Spatial Concepts and Data Models • 3. Spatial Query Languages: SQL3 • 4. Spatial Storage and Indexing: R-tree, Grid File • 5. Query Processing and Query Optimization • Strategies for range query, nearest neighbor query • Spatial joins (e.g. tree matching), cost models • 6. Spatial Network Model • 7. Spatial Data Mining • Spatial auto-correlation, co-location patterns, spatial outliers, classification methods • 8. Trends in Spatial Database (Moving Object)

1. Introduction • Traditional (non-spatial) database management systems provide: • Persistence across failures • Allows concurrent access to data • Scalability to search queries on very large datasets which do not fit inside main memories of computers • Efficient for non-spatial queries, but not for spatial queries • Non-spatial queries: • List the names of all bookstore with more than ten thousand titles. • List the names of ten customers, in terms of sales, in the year 2001 • Use an index to narrow down the search • Spatial Queries: • List the names of all bookstores with ten miles of Minneapolis • List all customers who live in Tennessee and its adjoining states • List all the customers who reside within fifty miles of the company headquarter

1. Spatial Data Examples • Examples of non-spatial data • Names, phone numbers, … • Examples of Spatial data • Census Data • NASA satellites imagery - terabytes of data per day • Weather and Climate Data • Rivers, Farms, ecological impact • Medical Imaging

2. Spatial Object Model • Object model concepts • Objects: distinct identifiable things relevant to an application • Objects have attributes and operations • Attribute: a simple (e.g. numeric, string) property of an object • Operations: function maps object attributes to other objects • Example from a roadmap • Objects: roads, landmarks, ... • Attributes of road objects: • spatial: location, e.g. polygon boundary of land-parcel • non-spatial: name (e.g. Route 66), type (e.g. interstate, residential street), number of lanes, speed limit, … • Operations on road objects: determine center line, determine length, determine intersection with other roads, ...

2. Classifying Spatial objects • Spatial objets are spatial attributes of general objects • Spatial objects are of many types • Simple • 0- dimensional (points), 1 dimensional (curves), 2 dimensional (surfaces) • Example given at the bottom of this slide • Collections • Polygon collection (e.g. boundary of Japan or Hawaii), … • See more complete list in Figure 2.2

2. Spatial Object Types in OGIS Data Model Fig 2.2: Each rectangle shows a distinct spatial object type

2. Classifying Operations on spatial objects in Object Model • Classifying operations • Set based: 2-dimensional spatial objects (e.g. polygons) are sets of points • A set operation (e.g. intersection) of 2 polygons produce another polygon • Topological operations: Boundary of USA touches boundary of Canada • Directional: New York city is to east of Chicago • Metric: Chicago is about 700 miles from New York city.

2. Specifying topological operation Fig 2.3: 9 intersection matrices for a few topological operations

2. Conceptual DM: The ER Model • 3 basic concepts • Entities have an independent conceptual or physical existence. • Examples: Forest, Road, Manager, ... • Entities are characterized by Attributes • Example: Forest has attributes of name, elevation, etc. • An Entity interacts with another Entity through relationships. • Road allow access to Forest interiors. • This relationship may be name “Accesses”

2. ER Diagram for “State-Park” Fig 2.4

Pictorial Enhanced ER Diagram for “State-Park

2. Mapping ER to Relational • Highlights of translation rules • Entity becomes Relation • Attributes become columns in the relation • Multi-valued attributes become a new relation • includes foreign key to link to relation for the entity • Relationships (1:1, 1:N) become foreign keys • M:N Relationships become a relation • containing foreign keys or relations from participating entities

3. Three Components of SQL? • Data Definition Language (DDL) • Creation and modification of relational schema • Schema objects include relations, indexes, etc. • Data Manipulation Language (DML) • Insert, delete, update rows in tables • Query data in tables • Data Control Language (DCL) • Concurrency control, transactions • Administrative tasks, e.g. set up database users, security permissions

3. Creating Tables in SQL • Table definition • “CREATE TABLE” statement • Specifies table name, attribute names and data types • Create a table with no rows. • See an example at the bottom • Related statements • ALTER TABLE statement modifies table schema if needed • DROP TABLE statement removes an empty table

3. Populating Tables in SQL • Adding a row to an existing table • “INSERT INTO” statement • Specifies table name, attribute names and values • Example: • INSERT INTO River(Name, Origin, Length) VALUES(‘Mississippi’, ‘USA’, 6000) • Related statements • SELECT statement with INTO clause can insert multiple rows in a table • Bulk load, import commands also add multiple rows • DELETE statement removes rows • UPDATE statement can change values within selected rows

3. SELECT Statement- General Information • Clauses • SELECT specifies desired columns • FROM specifies relevant tables • WHERE specifies qualifying conditions for rows • ORDER BY specifies sorting columns for results • GROUP BY, HAVING specifies aggregation and statistics • Operators and functions • arithmetic operators, e.g. +, -, … • comparison operators, e.g. =, <, >, BETWEEN, LIKE… • logical operators, e.g. AND, OR, NOT, EXISTS, • set operators, e.g. UNION, IN, ALL, ANY, … • statistical functions, e.g. SUM, COUNT, ... • many other operators on strings, date, currency, ...

4. Query Operation & Spatial Index • Filter Step: • Select the objects whose mbb satisfies the spatial predicate • Traverse the index apply the spatial test on the mbb • Output: set of oids • Refinement Step: • Spatial test is done on the actual geometries of objects whose mbb satisfied the filter step • Costly operation • Executed only on a limited number of objects • Concentrate on the design of efficient SAMs for the filter step

4. Why spatial index method? • B-tree & hash tables • Guarantee the number of I/O operations is respectively logarithmic and constant in the collection sized • Index a collection on a key • Rely on a total order on the key domain, the order of natural numbers, or the lexicographic order on strings • There is no such total order for geometric objects • SAMs were designed to try as much as possible to preserve spatial object proximity

4. Space-Driven v.s. Data-Driven SAMs • Space-Driven structures: • Partition the embedding 2D Space into rectangular cells • Independently of the distribution of the objects • Objects are mapped to the cells based on some geometric criterion • Grid file, linear structure • Data-Driven structures: • Organized by partitioning the set of objects, as opposed to the embedding space • Adapts to the objects’ distribution in the embedding space • R-tree, R* tree, R+ tree

4. Grid File – point indexing • One page is associated with each cell • When a cell overflow, it is split into two cells and the points are assigned to the new cell • Two adjacent cells can reference the same page • The cells are of different size and the partition adapts to the point distribution

4. The Quad tree • The index is represented as a quaternary tree • Each internal node has four children, one per quadrant • NW, NE, SW, SE • Each leaf is associated a disk page, which stores the index entries

4. The original R-Tree • A leaf entry is a pair (mbb, oid) • A non-leaf node contains an array of node entries • The number of entries is between m and M • For each entry (dr, node_id) in a non-leaf node N, dr is the directory rectangle of a child node of N, whose page address is node_id • All leaves are at the same level • An object appears in one, and only one of the tree leaves

4. The R+ Tree • The directory rectangles at a given level do not overlap • For a point query, a single path is followed from the root to a leaf • The I/O complexity is bounded by the depth of the tree

5. What is Query Processing and Optimization (QPO)? • Basic idea of QPO • In SQL, queries are expressed in high level declarative form • QPO translates a SQL query to an execution plan • over physical data model • using operations on file structures, indices, etc. • Ideal execution plan answers Q in as little time as possible • Constraints: QPO overheads are small • Computation time for QPO steps << that for execution plan

5. QPO Challenges in SDBMS • Building Blocks for spatial queries • Rich set of spatial data types, operations • A consensus on “building blocks” is lacking • Current choices include spatial select, spatial join, nearest neighbor • Choice of strategies • Limited choice for some building blocks, e.g. nearest neighbor • Choosing best strategies • Cost models are more complex since • Spatial Queries are both CPU and I/O intensive • While traditional queries are I/O intensive • Cost models of spatial strategies are not mature.

5. Choice of building blocks • Choice of building blocks • Varies across software vendors and products • List of representative building blocks • Point Query- Name a highlighted city on a digital map. • Return one spatial object out of a table • Range Query- List all countries crossed by of the river Amazon. • Returns several objects within a spatial region from a table • Spatial Join: List all pairs of overlapping rivers and countries. • Return pairs from 2 tables satisfying a spatial predicate • Nearest Neighbor: Find the city closest to Mount Everest. • Return one spatial object from a collection

5. Strategies for Spatial Joins • Recall Spatial Join Example: • List all pairs of overlapping rivers and countries. • Return pairs from 2 tables satisfying a spatial predicate • List of strategies • Nested loop: • Test all possible pairs for spatial predicate • All rivers are paired with all countries • Space Partitioning: • Test pairs of objects from common spatial regions only • Rivers in Africa are tested with countries in Africa only! • Tree Matching • Hierarchical pairing of object groups from each table, section 5.1.6 pp.121 • Other, e.g. spatial-join-index based, external plane-sweep, …

5. Query Processing and Optimizer process • A site-seeing trip • Start: A SQL Query • End: An execution plan • Intermediate Stopovers • query trees • logical tree transforms • strategy selection • What happens after the journey? • Execution plan is executed • Query answer returned Fig 5.2

5. Query Trees • Nodes = building blocks of (spatial) queries • See section 3.2 (pp.55) for symbols sigma, pi and join • Children = inputs to a building block • Leafs = Tables • Example SQL query and its query tree follows: Fig 5.3

5. Logical Transformation of Query Trees • Motivation • Transformation do not change the answer of the query • But can reduce computational cost by • reducing data produced by sub-queries • reducing computation needs of parent node • Example Transformation • Push down select operation below join • Example: Fig. 5.4 (compare w/ Fig 5.3, last slide) • Reduces size of table for join operation • Other common transformations • Push project down • Reorder join operations • ... Fig 5.4

5. Execution Plans • An execution plan has 3 components • A query tree • An ordering of evaluation of non-leaf nodes • A strategy selected for each non-leaf node • Example • Strategies for Query tree in Fig. 5.5 • Use scan for Area(L.Geometry) > 20 • Use index for Fa.Name = ‘Campground’ • Use space-partitioning join for • Distance(Fa, L) < 50 • Use on-the-fly for projection • Ordering • As listed above Fig 5.5

7. What is Spatial Data Mining? • Non-trivial search for interesting and unexpected spatial pattern • Non-trivial Search • Large (e.g. exponential) search space of plausible hypothesis • Ex. Asiatic cholera : causes: water, food, air, insects, …; water delivery mechanisms - numerous pumps, rivers, ponds, wells, pipes, ... • Interesting • Useful in certain application domain • Ex. Shutting off identified Water pump => saved human life • Unexpected • Pattern is not common knowledge • May provide a new understanding of world • Ex. Water pump - Cholera connection lead to the “germ” theory

7. Choice of Methods • Two Approaches to mining Spatial Data • Pick spatial features; use classical DM methods • Use novel spatial data mining techniques • Possible Approach: • Define the problem: capture special needs • Explore data using maps, other visualization • Try reusing classical DM methods • If classical DM perform poorly, try new methods • Evaluate chosen methods rigorously • Performance tuning as needed

7. Location Prediction as a classification problem Given: 1. Spatial Framework 2. Explanatory functions: 3. A dependent class: 4. A family of function mappings: Find: Classification model: Objective:maximize classification_accuracy Constraints: Spatial Autocorrelation exists Nest locations Distance to open water Vegetation durability Water depth Color version of Fig. 7.3, pp. 188

7. Techniques for Location Prediction • Classical method: • logistic regression, decision trees, bayesian classifier • assumes learning samples are independent of each other • Spatial auto-correlation violates this assumption! • Q? What will a map look like where the properties of a pixel was independent of the properties of other pixels? (see below - Fig. 7.4, pp. 189) • New spatial methods • Spatial auto-regression (SAR), • Markov random field • bayesian classifier

7. Spatial AutoRegression (SAR) • Spatial Autoregression Model (SAR) • y = Wy + X +  • W models neighborhood relationships •  models strength of spatial dependencies •  error vector • Solutions •  and  - can be estimated using ML or Bayesian stat. • e.g., spatial econometrics package uses Bayesian approach using sampling-based Markov Chain Monte Carlo (MCMC) method. • Likelihood-based estimation requires O(n3) ops. • Other alternatives – divide and conquer, sparse matrix, LU decomposition, etc.

7. Associations, Spatial associations, Co-location Answers: and

7. Association Rules: Formal Definitions • Consider a set of items, • Consider a set of transactions • where each is a subset of I. • Support of C • Then iff • Support: occurs in at least s percent of the transactions: • Confidence: At least c% • Example: Table 7.4 (pp. 202) using data in Section 7.4

Association rules Co-location rules Underlying space discrete sets continuous space item-types item-types events /Boolean spatial features collection Transaction (T) Neighborhood (N) prevalence measure support participation index conditional probability metric Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at location L ] 7. Co-location rules vs. association rules Participation index = min{pr(fi, c)} Where pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: = fraction of instances of fi with feature {f1, …, fi-1, fi+1, …, fk} nearby N(L) = neighborhood of location L

7. Spatial Outlier Detection • Compute where • Select points (e.g. S with Z(S(x)) above 3

7. Spatial Outlier Detection: Example Color version of Fig. 7.19 pp. 219 Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function : V -> R Find O = {vi | vi V, vi is a spatial outlier} Spatial Outlier Detection Test 1. Choice of Spatial Statistic S(x) = [f(x)–E y N(x)(f(y))] 2. Test for Outlier Detection | (S(x) - s) / s | >  Rationale: Theorem: S(x) is normally distributed if f(x) is normally distributed

8. Spatiotemporal Data • Two types of problems: • Indexing the current positions and movements of objects and querying their anticipated future positions. • Indexing and querying the past movements of mobile objects. • On Indexing Mobile Objects • Indexing the Positions of Continuously Moving Objects

Spatiotemporal Data (cont’d) • Indexing current/future locations mobile objects • The TPR-tree • Like the R-tree, but the MBRs are time-parameterized to conservative bounding intervals (CBI). • How are the CBI computed? What is the best way to group objects into a CBI? • By minimizing an objective function (e.g., overlap) over the time the TPR-tree is valid. • How do we answer queries using the TPR-tree?

Conclusion • Good progress… still more work is needed: • Devising clean and complete semantics for data models and operators for spatial data, spatial-temporal data • Efficient implementation • Indexing, query processing, query optimization, cost model • Develop efficient algorithms to mine spatial data • Alternatives architectures • spatial-temporal data, moving objects • mobile, wireless applications • web GIS

Introduction to Spatial Data Management: Concepts, Techniques, and Trends

Introduction to Spatial Data Management: Concepts, Techniques, and Trends

Presentation Transcript

Today s Agenda

Agenda Today

Agenda today

Agenda today

Agenda Today

Agenda for today

AGENDA FOR TODAY

Agenda Today

Agenda Today

Our agenda today

Agenda Today

Agenda For Today!

Agenda today

Agenda for Today

Agenda for today

Agenda Today

Today ’ s Agenda

Today ’ s Agenda

Agenda for Today