1 / 61

ICS 214A: Database Management Systems Winter 2004

ICS 214A: Database Management Systems Winter 2004. Lecture 08: Multi-dimensional Indexing. Motivation. Many applications of databases are geographical (2-d) data. Others involve large number of dimensions Examples: location of restaurants in a city.

kennedy-lee
Télécharger la présentation

ICS 214A: Database Management Systems Winter 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICS 214A: Database Management Systems Winter 2004 Lecture 08: Multi-dimensional Indexing

  2. Motivation • Many applications of databases are geographical (2-d) data. Others involve large number of dimensions • Examples: • location of restaurants in a city. • Map data: zones, county lines, rivers, lakes, etc. (Data has spatial extent) • Sales information described by store, day, item, color, size, etc. Sale = point in multidimensional space. • Student described by age, zipcode, marital status. • Queries: • Point queries • Range Query: “find all McDonald restaurant within a given region”. • Nearest Neighbor Query: Find the nearest McDonald to my house • partial match queries • Spatial join (“all pairs” queries) Notes 08

  3. Types of Queries Point Query Range Query NN Query Spatial Join Query Notes 08

  4. Notes 08

  5. Notes 08

  6. Notes 08

  7. Other Applications with Multi-Dimensional Data • Mechanical CAD • VLSI • Bio Medical Imaging • OLAP • Multimedia data • … Notes 08

  8. Example Example: Find employee records where DEPT = “Toy” AND SAL > 50k. Notes 08

  9. Approach: Utilize Single Dimensional Index? • Index on attributes independently • Project query range to each attribute determine pointers. • Intersect pointers • Go to the database and retrieve objects in the intersection. May result in very high I/O cost Notes 08

  10. Example 1 • Use one index, say Dept. • Get all Dept = “Toy” records and check their salary I1 Notes 08

  11. Example 2 • Use 2 Indexes • Manipulate Pointers Toy Sal > 50K Notes 08

  12. Partitioned hash function Idea: Key1 Key2 010110 1110010 h1 h2 Notes 08

  13. <Fred> <Joe><Sally> Example h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . <Fred,toy,10k>,<Joe,sales,10k>, <Sally,art,30k> Insert Notes 08

  14. h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Dept. = Sales & Sal=40k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08

  15. look here h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Sal=30k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08

  16. look here h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . • Find Emp. with Dept. = Sales <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy> Notes 08

  17. Multi-key indexes I2 I3 I1 Notes 08

  18. Art Sales Toy Example Record Dept Index Salary Index 10k 15k 17k 21k Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k Notes 08

  19. For which queries is this index good? Find RECs Dept = “Sales” & SAL=20k Find RECs Dept = “Sales” & SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k Notes 08

  20. Multiple Key Index • Cannot support partial match queries on second attribute • performance of range search not much better compared to independent attribute approach • the secondary indices may be of different sizes -- specifically some of them may be very small Notes 08

  21. Multidimensional Indexing • Space Partitioning strategies • Recursive disjoint partitioning of space • Grid files, kd-tree, kdB-tree, HB tree, PK-tree, quad-tree and its variants, … • Object Grouping strategies • Group spatially close objects into clusters • R-tree, R*-tree, R+tree, X-tree, GiST, SR tree, … • Hybrid Strategies • Hybrid tree • Data structures support • Point query, range query, NN-query, join query Notes 08

  22. R-trees (range trees) • Extension of B-tree to multidimensional space. • Paginated, balanced, • Can support both point data and data with spatial extent (e.g., rectangles) • Group objects into possibly overlapping clusters (rectangles in our case) • Search of a range query proceeds along all paths that overlap with the query. R3 R4 R2 R1 R0 R0 R2 R1 R3 R4 Notes 08

  23. R-Tree 1 2 3 4 5 6 11 7 9 12 8 10 Notes 08

  24. R-Tree 1 2 3 E F 4 5 6 11 7 G 9 12 8 H 10 Notes 08

  25. R-Tree 1 B 2 3 E F 4 5 6 11 C 7 G 9 12 8 H 10 Notes 08

  26. R-Tree A 1 B 2 3 E F 4 5 6 11 C 7 G 9 12 8 H 10 Notes 08

  27. R-Tree A: B C G H C: F E B: E F G 6 4 5 1 2 3 10 11 12 H 7 8 9 Notes 08

  28. R-tree Insert Object E • Step I1 • Chooseleaf L to Insert E /* find position to insert*/ • Step I2 • If L has room install E • Else SplitNode(L) • Step I3: • Adjust Tree /* propagate changes*/ • Step I4: • if node split propagates to root, adjust tree height Notes 08

  29. Function: ChooseLeaf • Step CL1: • Set N to be root • Step CL2: • If N is a leaf, return N • Step CL3: • If N is not a leaf, let F be an entry whose rectangle needs least enlargement to include the object • Step CL4 • Set N to be child node pointed by entry F • goto Step CL2 Notes 08

  30. Split Node • Given a node split it into two nodes which are each atleast half full • Multiple Objectives: • minimize overlap • minimize covered area • R-tree minimizes covered area • What is an optimal criteria??? Minimize covered area Minimize overlap Notes 08

  31. Minimizing Covered Area • Group objects into 2 parts such that the covered area is minimized • NP Hard!! • Hence use heuritics • Two heuristics explored • quadratic and linear Notes 08

  32. Basic Split Strategy • /* Divide the set of M+1 entries into 2 groups G1 and G2 */ • PickSeeds for G1 and G2 • Invoke PickNext to assign an object to a group recursively until either all objects assigned or one of the groups becomes half full. • If one group gets half full assign rest of the objects to the other group. Notes 08

  33. Quadratic Split • PickSeed: • for each pair of entries E1 and E2 compose a rectangle J including E1.rect and E2.rect • let d = area(J) - area(E1.rect) - area(E2.rect) /* d is wasted space */ • Choose the most wasteful pair with largest d as seeds for groups G1 and G2. • PickNext /*select next entry to put in a group */ • Determine cost of putting each entry in the group G1 and G2 • for each unassigned entry calculate • d1 = area increase required in the covering rectangle in Group G1 to include the entry • d2= area increase required in the covering rectangle in Group G2 to include the entry. • Select entry with greatest preference for a group • choose any entry with the maximum difference between d1 and d2 Notes 08

  34. Linear Split • PickSeed • find extreme rectangles along each dimension • find entries with the highest low side and the lowest high side • record the separation • Normalize the separation by width of extent along the dimension • Choose as seeds the pair that has the greatest normalized distance along any dimension • PickNext • randomly choose entry to assign Notes 08

  35. R-tree Search (Range Search on range S) • Start from root • If node T is not leaf • check entries E in T to determine if E.rectangle overlaps S • for all overlapping entries invoke search recursively • If T is leaf • check each entry to see if it entry satisfies range query Notes 08

  36. K A F G X J B D E I H A B C M D E F G H I J K L M N L N C Example Search window X- tree is traversed down when overlap with X Notes 08

  37. R-tree Delete • Step D1 • find the object and delete entry • Step D2 • Condense Tree • Step D3 • if root has 1 node shorten tree height Notes 08

  38. Condense Tree • If node is too empty • delete entry from parent and add to a set Q • Adjust bounding rectangle of parent • Do the above recursively for all levels • Reinsert all the orphaned entries in Q • insert them at the same level they were deleted. Notes 08

  39. Nearest Neighbor Search • Retrieve the nearest neighbor of query point Q • Simple Strategy: • convert the nearest neighbor search to range search. • Guess a range around Q that contains at least one object say O • if the current guess does not include any answers, increase range size until an object found. • Compute distance d’ between Q and O • re-execute the range query with the distance d’ around Q. • Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!! Why? • Issues: how to guess range, the retrieval may be sub-optimal if incorrect range guessed. Becomes a problem in high dimensional spaces. Notes 08

  40. Nearest Neighbor Search using Range Searches Distance between Q and A b Initial range search Q A Revised range search A optimal strategy that results in minimum number of I/Os possible using priority queues. Notes 08

  41. Optimal Strategy for KNN search • The optimal algorithm maintains a priority queue in memory which contains tree nodes and objects • The nodes in the queue sorted based on MINDIST • Nodes traversed in the order MINDIST • Algorithm stops when an object is at the top of the queue (least mindist). This object is the NN. • One can compute K-NN incrementally. • The algorithm is I/O optimal Notes 08

  42. MINDIST Between Point and Rectangle • MINDIST (P, R) is the minimum distance between a point P and a rectangle R. • If the point is inside the rectangle, MINDIST = 0 • If the point is outside the rectangle, MINDIST is the minimal possible distance from the point to any object in or on the perimeter of the rectangle. Notes 08

  43. MINDIST Between Rectangle and Point Q Q T Q S Notes 08

  44. MINDIST Property • MINDIST is a lower bound of any k-NN distance Notes 08

  45. Improving the KNN Algorithm • While the mindist based algorithm is I/O optimal, its performance may be further improved by pruning nodes from the priority queue. Notes 08

  46. MBR face property • MBR is an n-dimensional Minimal Bounding Rectangle used in R trees, which is the minimal bounding n-dimensional rectangle bounds its corresponding objects. • MBR face property: Every face of any MBR contains at least one point of some object in the DB. Notes 08

  47. MBR face property Notes 08

  48. MBR face property in 3-d Notes 08

  49. Search improvement • Visit an MBR only when necessary • How to do pruning? • MINDIST • MINMAXDIST Notes 08

  50. MINDIST Property • MINDIST is a lower bound of any k-NN distance Notes 08

More Related