610 likes | 717 Vues
ICS 214A: Database Management Systems Winter 2004. Lecture 08: Multi-dimensional Indexing. Motivation. Many applications of databases are geographical (2-d) data. Others involve large number of dimensions Examples: location of restaurants in a city.
E N D
ICS 214A: Database Management Systems Winter 2004 Lecture 08: Multi-dimensional Indexing
Motivation • Many applications of databases are geographical (2-d) data. Others involve large number of dimensions • Examples: • location of restaurants in a city. • Map data: zones, county lines, rivers, lakes, etc. (Data has spatial extent) • Sales information described by store, day, item, color, size, etc. Sale = point in multidimensional space. • Student described by age, zipcode, marital status. • Queries: • Point queries • Range Query: “find all McDonald restaurant within a given region”. • Nearest Neighbor Query: Find the nearest McDonald to my house • partial match queries • Spatial join (“all pairs” queries) Notes 08
Types of Queries Point Query Range Query NN Query Spatial Join Query Notes 08
Other Applications with Multi-Dimensional Data • Mechanical CAD • VLSI • Bio Medical Imaging • OLAP • Multimedia data • … Notes 08
Example Example: Find employee records where DEPT = “Toy” AND SAL > 50k. Notes 08
Approach: Utilize Single Dimensional Index? • Index on attributes independently • Project query range to each attribute determine pointers. • Intersect pointers • Go to the database and retrieve objects in the intersection. May result in very high I/O cost Notes 08
Example 1 • Use one index, say Dept. • Get all Dept = “Toy” records and check their salary I1 Notes 08
Example 2 • Use 2 Indexes • Manipulate Pointers Toy Sal > 50K Notes 08
Partitioned hash function Idea: Key1 Key2 010110 1110010 h1 h2 Notes 08
<Fred> <Joe><Sally> Example h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . <Fred,toy,10k>,<Joe,sales,10k>, <Sally,art,30k> Insert Notes 08
h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Dept. = Sales & Sal=40k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08
look here h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Sal=30k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08
look here h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Dept. = Sales <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08
Multi-key indexes I2 I3 I1 Notes 08
Art Sales Toy Example Record Dept Index Salary Index 10k 15k 17k 21k Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k Notes 08
For which queries is this index good? Find RECs Dept = “Sales” & SAL=20k Find RECs Dept = “Sales” & SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k Notes 08
Multiple Key Index • Cannot support partial match queries on second attribute • performance of range search not much better compared to independent attribute approach • the secondary indices may be of different sizes -- specifically some of them may be very small Notes 08
Multidimensional Indexing • Space Partitioning strategies • Recursive disjoint partitioning of space • Grid files, kd-tree, kdB-tree, HB tree, PK-tree, quad-tree and its variants, … • Object Grouping strategies • Group spatially close objects into clusters • R-tree, R*-tree, R+tree, X-tree, GiST, SR tree, … • Hybrid Strategies • Hybrid tree • Data structures support • Point query, range query, NN-query, join query Notes 08
R-trees (range trees) • Extension of B-tree to multidimensional space. • Paginated, balanced, • Can support both point data and data with spatial extent (e.g., rectangles) • Group objects into possibly overlapping clusters (rectangles in our case) • Search of a range query proceeds along all paths that overlap with the query. R3 R4 R2 R1 R0 R0 R2 R1 R3 R4 Notes 08
R-Tree 1 2 3 4 5 6 11 7 9 12 8 10 Notes 08
R-Tree 1 2 3 E F 4 5 6 11 7 G 9 12 8 H 10 Notes 08
R-Tree 1 B 2 3 E F 4 5 6 11 C 7 G 9 12 8 H 10 Notes 08
R-Tree A 1 B 2 3 E F 4 5 6 11 C 7 G 9 12 8 H 10 Notes 08
R-Tree A: B C G H C: F E B: E F G 6 4 5 1 2 3 10 11 12 H 7 8 9 Notes 08
R-tree Insert Object E • Step I1 • Chooseleaf L to Insert E /* find position to insert*/ • Step I2 • If L has room install E • Else SplitNode(L) • Step I3: • Adjust Tree /* propagate changes*/ • Step I4: • if node split propagates to root, adjust tree height Notes 08
Function: ChooseLeaf • Step CL1: • Set N to be root • Step CL2: • If N is a leaf, return N • Step CL3: • If N is not a leaf, let F be an entry whose rectangle needs least enlargement to include the object • Step CL4 • Set N to be child node pointed by entry F • goto Step CL2 Notes 08
Split Node • Given a node split it into two nodes which are each atleast half full • Multiple Objectives: • minimize overlap • minimize covered area • R-tree minimizes covered area • What is an optimal criteria??? Minimize covered area Minimize overlap Notes 08
Minimizing Covered Area • Group objects into 2 parts such that the covered area is minimized • NP Hard!! • Hence use heuritics • Two heuristics explored • quadratic and linear Notes 08
Basic Split Strategy • /* Divide the set of M+1 entries into 2 groups G1 and G2 */ • PickSeeds for G1 and G2 • Invoke PickNext to assign an object to a group recursively until either all objects assigned or one of the groups becomes half full. • If one group gets half full assign rest of the objects to the other group. Notes 08
Quadratic Split • PickSeed: • for each pair of entries E1 and E2 compose a rectangle J including E1.rect and E2.rect • let d = area(J) - area(E1.rect) - area(E2.rect) /* d is wasted space */ • Choose the most wasteful pair with largest d as seeds for groups G1 and G2. • PickNext /*select next entry to put in a group */ • Determine cost of putting each entry in the group G1 and G2 • for each unassigned entry calculate • d1 = area increase required in the covering rectangle in Group G1 to include the entry • d2= area increase required in the covering rectangle in Group G2 to include the entry. • Select entry with greatest preference for a group • choose any entry with the maximum difference between d1 and d2 Notes 08
Linear Split • PickSeed • find extreme rectangles along each dimension • find entries with the highest low side and the lowest high side • record the separation • Normalize the separation by width of extent along the dimension • Choose as seeds the pair that has the greatest normalized distance along any dimension • PickNext • randomly choose entry to assign Notes 08
R-tree Search (Range Search on range S) • Start from root • If node T is not leaf • check entries E in T to determine if E.rectangle overlaps S • for all overlapping entries invoke search recursively • If T is leaf • check each entry to see if it entry satisfies range query Notes 08
K A F G X J B D E I H A B C M D E F G H I J K L M N L N C Example Search window X- tree is traversed down when overlap with X Notes 08
R-tree Delete • Step D1 • find the object and delete entry • Step D2 • Condense Tree • Step D3 • if root has 1 node shorten tree height Notes 08
Condense Tree • If node is too empty • delete entry from parent and add to a set Q • Adjust bounding rectangle of parent • Do the above recursively for all levels • Reinsert all the orphaned entries in Q • insert them at the same level they were deleted. Notes 08
Nearest Neighbor Search • Retrieve the nearest neighbor of query point Q • Simple Strategy: • convert the nearest neighbor search to range search. • Guess a range around Q that contains at least one object say O • if the current guess does not include any answers, increase range size until an object found. • Compute distance d’ between Q and O • re-execute the range query with the distance d’ around Q. • Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!! Why? • Issues: how to guess range, the retrieval may be sub-optimal if incorrect range guessed. Becomes a problem in high dimensional spaces. Notes 08
Nearest Neighbor Search using Range Searches Distance between Q and A b Initial range search Q A Revised range search A optimal strategy that results in minimum number of I/Os possible using priority queues. Notes 08
Optimal Strategy for KNN search • The optimal algorithm maintains a priority queue in memory which contains tree nodes and objects • The nodes in the queue sorted based on MINDIST • Nodes traversed in the order MINDIST • Algorithm stops when an object is at the top of the queue (least mindist). This object is the NN. • One can compute K-NN incrementally. • The algorithm is I/O optimal Notes 08
MINDIST Between Point and Rectangle • MINDIST (P, R) is the minimum distance between a point P and a rectangle R. • If the point is inside the rectangle, MINDIST = 0 • If the point is outside the rectangle, MINDIST is the minimal possible distance from the point to any object in or on the perimeter of the rectangle. Notes 08
MINDIST Between Rectangle and Point Q Q T Q S Notes 08
MINDIST Property • MINDIST is a lower bound of any k-NN distance Notes 08
Improving the KNN Algorithm • While the mindist based algorithm is I/O optimal, its performance may be further improved by pruning nodes from the priority queue. Notes 08
MBR face property • MBR is an n-dimensional Minimal Bounding Rectangle used in R trees, which is the minimal bounding n-dimensional rectangle bounds its corresponding objects. • MBR face property: Every face of any MBR contains at least one point of some object in the DB. Notes 08
MBR face property Notes 08
MBR face property in 3-d Notes 08
Search improvement • Visit an MBR only when necessary • How to do pruning? • MINDIST • MINMAXDIST Notes 08
MINDIST Property • MINDIST is a lower bound of any k-NN distance Notes 08