# R-TREES

Télécharger la présentation

## R-TREES

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. R-TREES CMPE 422:Database systems Ahmet Menteş

2. TOPICS COVERED • Introduction • Spatial Data and Queries • R-Tree Index • Algorithmsand Complexities • Performance • R-Tree Variations • Conclusion

3. INTRODUCTION • First Proposed By Antonin Guttman, 1984 from UC Berkeley -“A Dynamic Index Structure For Spatial Searching” • R-Tree structures are used for effectively storing and indexing spatial data. Usage areas are wide • -Computer Aided Design-CAD • -Video Recognition-VR • -Image Processing

4. INTRODUCTION • R-Trees have correspondance with B-Trees but used for keeping spatial data. -Based on B-trees • B-Trees cannot store new types of data • Wanted to store geometrical data and multi-dimensional data • Region-trees (R-trees)

5. SPATIAL DATA • Any Type of Geometry • Point • City • Line • Trail • Polygon • Border • A Collection of Geometries • Ski Resort Trails • Any Coordinate System • Meters • Pixels • WGS84 (GPS)

6. SPATIAL QUERY TYPES • Standard Insert, Search and Delete Queries • Spatial Range Queries • Find all cities within 30 kmof İstanbul • Nearest Neighbor Queries • Find the closest restourants to my address • Spatial Join Queries • Find all neighborhoods that are within 5 km of a university • Geometry Set Operations • Equal(), Disjoint(), Intersect(), Touch(), Cross(), Within(), Contains(), Overlap(), Distance(), Intersection(), Union(), Difference() and more...

7. R-TREE • R-tree is a depth-balanced tree –Each node corresponds to a disk page –Leaf node: an array of leaf entries A leaf entry: (mbb,oid) –Non-leaf node: an array of node entries A node entry: (dr, nodeid)

8. R-TREE

9. R-TREE INDEX STRUCTURE • Data objects in the map are represented by the Minimum Bounding Rectangles (MBRs)

10. R-TREE INDEX STRUCTURE Formal Description • Structure consisting of (regular) nodes containing tuples <I, child-pointer> • At the lowest level: leaf-nodes with tuples <I, tuple-identifier> • I = (I0, I1, … In) corresponds to MBR • n = Number of Dimensions in the Geometry • Each I is a set of the form [a,b] describing the range of the rectangle along the dimension • a or b can be equal to infinity • Tuple-identifier points to a record

11. R-TREE INDEX STRUCTURE • Important Parameters for R-Tree index • M is the maximum number of entries in one node • Parameter m ≤ M/2 specifies the minimum number of entries in a node • Every Leaf Node Contains Between m and M index records unless it is root. • For each index record, <I, tuple-identifier> in a leaf node is the smallest rectangle that spatially contains the n-dimensional data object. • Every non-leaf node has between m and M children unless it is the root. • For each entry <I, child-pointer> in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles in the child nodes. • The root node has at least two children unless it is a leaf. • All leaves appear on the same level. • Height of an R-Tree  logmN-1

12. ALGORITHMS OVERVIEW • For all queries, it is possible to check if a point is within a rectangle in linear time (O(n)). • Query Types To Be Reviewed • Search • Splitting • Insertion • Deletion • Nearest Neighbor

13. SEARCH • Similar to B-tree search • Quite easy & straight forward(Traverse the whole tree starting at theroot node) • No guarantee on good worst-case performance!(Possible overlapping of rectangles of entries within a single node!)

14. SEARCH • For all entries in a non-leaf node: Check if overlapping If yes: check node pointed to by this entry. • If node is a leaf-node: Check all entries if overlapping the search object If yes: entry is a qualifying record!

15. SEARCH

16. SEARCH COMPLEXITY • Worst Case • Linear Search • Every MBR overlaps the search area • Best Case • No more than one overlap at each level • Complexity becomes O(logMn) • Again, dependent on geometries

17. SPLITTING • Splitting of nodes necessary afterunderflow oroverflow (as a result of a delete or insert operation) • Ultimate goal: Minimize the resulting node’s MBRs • Secondary aim: Speed of algorithm • 3 Implementations to split an R-Tree: Exhaustive, Quadratic-cost, Linear-cost

18. SPLITTING • Minimal resulting MBRs: • Exhaustive Approach: • Simply try all possible split-ups but this kind of approach is very costly in terms of complexity issue.

19. SPLITTING • Quadratic-cost Algorithm: • Pick the 2 out of M entries that would consume the most space if put together. Put one in each group • For all remaining entries: Pick the one that would makethe biggest difference in area when put to one of the two groups. Add it to the one with the least difference • Finished when all entries are put in either group! • Quadratic in M, linear in the number of dimensions.

20. SPLITTING • Linear-cost Implementation: • Basically the same as quadratic-cost algorithm • Differences: First pair is picked by finding the two rectangles with the greatest normalizedseparation in anydimension. Remaining pairs are selected randomly. • Linear in M as well as in the number ofdimensions

21. INSERTION • Similar to corresponding B-tree operation • Basically consist of 3 parts: 1. CHOOSELEAF (Find place to insert new object) 2. INSERT (Insert the new object) 3. ADJUSTTREE (Adjusting preceding nodes)

22. INSERTION Description: • CHOOSELEAF: • Start with root and run through all nodes then; • Find the one which would have to be leastenlarged to include given object! • INSERT: • Check if room for another entry Insert new entry directly OR after callingSPLITNODE (in case of no room)

23. INSERTION • ADJUSTTREE: • Ascend from node with new entry: • Adjust all MBRs! • In case of node split: • Add new entry to parent node(If no room in parent node, invokeSPLITNODE again) • Propagate upwards till root is reached!

24. INSERTION

25. INSERTION

26. INSERTION

27. INSERTION COMPLEXITY • IMPORTANT: Make sure nodes are split so they cover the smallest possible area. • Minimize search time • Important variables • N = Number of entries in each node • T = Tree height • Worst Case • 2 * N * T • O(n) GOOD SPLIT! BAD!

28. DELETION • NOT similar to B-tree DELETE (Treatment of underflows) • B-tree: Merge under-full node with “neighbor” • R-tree: Delete under-full node and re-insert • Why? Re-use of INSERT routine Incrementally refines spatial structure

29. DELETION Description • FINDLEAF: • Locate the leaf that contains object to be deleted • DELETE: • Delete entry from node • CONDENSETREE: • If node with the removed entry has too few entries, reallocate them • Recursively check the parent nodes until reaching the root • Update all MBR and remove all nodes that underfull • Reinsert all entries removed from the removed nodes according to the INSERT algorithm.

30. DELETION

31. DELETION COMPLEXITY • Deletion variables • N = number of entries in each node • T = tree height • Complexity • 2 * N * T

32. NEAREST NEIGHBOR • Two Options • Branch-and-bound search • Best first search • Branch-and-bound • Find two distances to each object • Minimum distance from the search point to any side of the other object’s MBR • Minmax distance • Least of the furthest distance in every dimension • Lowest upper bound on the distance from the point to an object • Best First Search • Calculates minmax distance for all objects • R-Tree sorted by minmax distance • Removes nodes from sorted tree • If node has no children it is the nearest neighbor.

33. NEAREST NEIGHBOR • Branch-and-Bound • Takes longer because it searches all nodes that have not been pruned • Best First Search • Investigates only the closest nodes • Large priority queue data structure in memory • Can cause thrashing • Run-time complexity subject to geometries • How many overlap and how large

34. PERFORMANCE Why do Benchmarking? • Proof practicality of the structure • Find suitable values for m and M • Test various node-splitting algorithms Testing environment: • C-implementation running under UNIX • Layout of a RISC-II chip with 1057 rectangles • Different page sizes (128 – 2048 bytes) • Various values for M (6 – 102)

35. PERFORMANCE • Linear INSERT fastest • Linear INSERT quite insensitive to M and m • Quadratic INSERT depends on M as well as m • DELETE extremely sensitive to m

36. PERFORMANCE • Same page hit / miss performance for linear and quadratic split • Slight advantage for exhaustive version • Space usage strongly depending on m

37. PERFORMANCE • Quadratic-cost splitting with jumps at INSERT operations for increasing amount of data • Linear INSERT extremely constant • R-tree structure very effective in directing search to small subtrees

38. R-TREE VARIATIONS R+-Tree: • R+ trees differ from R trees in that: • Nodes are not guaranteed to be at least half filled • The entries of any internal node do not overlap • An object ID may be stored in more than one leaf node. • Advantages • Because nodes are not overlapped with each other, point query performance benefits since all spatial regions are covered by at most one node. • A single path is followed and fewer nodesare visited than with the R-tree. • Disadvantages • Since rectangles are duplicated, an R+ tree can be larger than an R tree built on same data set. • Construction and maintenance of R+ trees is more complex than the construction and maintenance of R trees and other variants of the R tree. R-Tree MBRs R+-Tree MBRs

39. R-TREE VARIATIONS • R*-Tree • Do not split nodes on insert • Take entries from the overfull node and reinsert them into the tree • Changes MBRs • Saves time and possibly rebalances the tree

40. CONCLUSION Reasons for R-trees: • Importance of being able to store and effectively search spatial data. • Strong limitations with most spatialstructures (e.g.point data only, not dynamic, performance with paged memory, …) Basic Idea: • Structure similar to B-tree. • Represent objects by their MBR. • Allow overlapping.

41. CONCLUSION Algorithms: • Search and insert similar to B-tree operations • Delete performing re-insert instead of merge for under-full nodes (incrementally refines spatial structure) • Node splitting: Benchmarks show that linear version performs quiteas well as quadratic-cost implementation. Modifications: • Structure is easy to adapt for special applications and their needs. • Performance advantages are usually paid for with price of astructure that gets harder to maintain.