1 / 106

Multidimensional Access Methods

Multidimensional Access Methods. Volker Gaede IC-Parc, Imperial College, London Oliver Gunther Humboldt-Universitat, Berlin ACM Computing Surveys, June 1998 Mohammad Reza Kolahdouzan, kolahdoz@usc.edu. Outline. Organization of Spatial Data What is Special About Spatial?

fern
Télécharger la présentation

Multidimensional Access Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multidimensional Access Methods Volker Gaede IC-Parc, Imperial College, London Oliver Gunther Humboldt-Universitat, Berlin ACM Computing Surveys, June 1998 Mohammad Reza Kolahdouzan, kolahdoz@usc.edu

  2. Outline • Organization of Spatial Data • What is Special About Spatial? • Definitions and Queries • Basic Data Structures • One-Dimensional Access Method • Main Memory Structures • Point Access Methods • Multidimensional Hashing • Hierarchical Access Methods • Space Filling Curves for Point Data • Spatial Access Methods • Transformation • Overlapping Regions • Clipping • Multiple Layers • Comparative Studies, Conclusion

  3. What is special about Spatial Data? • Basic properties: • Complex structure • object could be point or thousands of polygons • variable tuple size • Dynamic • different operations (insert, delete, update) are interleaved • Large • e.g.: gigabytes for geographic maps • Integration of secondary and tertiary memory • Several proposals but no standard algebra • no standard set of operators • operators depends of domain (application specific) • Not closed operators • result of an operator can return any kind of object • Expensive computational costs

  4. What is special about Spatial Data? ... • Special physical layer support is required for “Search” operators (for search, update, … ) • Multidimensional access methods required. • Hard to do: no total ordering for spatial objects that preserves spatial proximity (i.e., no mapping from 2,3,... to dimensions to 1) • One solution: • single key structure, per dimension • problem: each index is traversed independently of others

  5. What is special about Spatial Data? ... • Requirements for multidimensional access methods: • Dynamic: keep track of changes (inserts, deletes, updates, ... ) • Secondary/tertiary storage management: not possible to have everything in main memory • Broad range of supported operations: not sacrifice one for others • Independence of the input data(distribution)and insertion sequence • Simplicity • Scalability • Time efficiency of search: goal is to meet one-dimensional B-tree: logarithmic worst case • Space efficiency • Concurrency and recovery: multiple concurrent accesses • minimum impact for integration

  6. Definitions • Multidimensional access method: support search in spatial DB • Point access methods (PAMs): designed to perform spatial search in point DBs • point could have 2,3,… dimensions, but no extension. • Spatial access methods (SAMs): can manage extended objects (lines, polygons, ...) • Universe, Original Space: a d-dimensional Euclidean space, E(d) • Any point has a unique location, defined by its d coordinates • d-dimensional polytope, P: intersection of some finite number of closed half-spaces in E(d), such that the dimension of the smallest affine subspace containing P is d. • D-dimensional polyhedron: • union of some polytopes. • Divides the points in space to: its interior, boundary, and exterior (mutually disjoint)

  7. Definitions ... • Examples for polyhedron: • line (polyline) = 1-dimensional polyhedron • region (polygon) = 2-dimensional polyhedron • A set of k-dimensional polyhedra forms a data type, (e.g., {point, line, region, … } • Spatial attribute of a spatial data describes its extension o.G • synonyms: • geometry, shape, spatial extension • shape descriptor, shape description, shape information, geometric description

  8. Definitions ... • Simple solutions for indexing: • Abstracting complex object with simpler shape, like bounding box or sphere, and providing a link to actual data • produces candidate solutions • based on application, refinement may be necessary • Decomposition, represent a shape with union of simpler shapes (e.g., convex polygons with bounded number of vertices).

  9. Definitions ... • What is efficiency? • Space: minimize number of occupied bytes • Time: not clear! • We care about elapsed time, but it depends on implementation, hardware, … • Solution: number of disk accesses for a search. • Based on assumption that search is I/O bound (B-tree) • May not be true for spatial objects • Should still minimize number of disk accesses

  10. Queries • No standard query language • Application domain dependent set of operators • Some more common operators (intersection, … ) • Queries are usually in term of extension to SQL

  11. Queries ... • Common Queries • Query 1: Exact Match Query EMQ, Object Query • returns objects that have similar extension

  12. Queries ... • Common Queries ... • Query 2: Point Query PQ • returns an object that contains a point

  13. Queries ... • Common Queries ... • Query 3: Window Query WQ, Range query • returns objects that have at least one point in an interval • intervals are iso-oriented (parallel to coordinate axes)

  14. Queries ... • Common Queries ... • Query 4: Intersection Query IQ, Region query, Overlap Query • returns objects that have at least one common point with another object

  15. Queries ... • Common Queries ... • Query 5: Enclosure Query EQ • returns objects that enclose another object

  16. Queries ... • Common Queries ... • Query 6: Containment Query CQ • returns objects that are contained in another object • contain and enclosure are dual

  17. Queries ... • Common Queries ... • Query 7: Adjacency Query AQ • returns adjacent objects to another one

  18. Queries ... • Common Queries ... • Query 8: Nearest Neighbor Query NNQ • returns an object that has the minimum distance to another one • distance between two objects: distance between their closest points

  19. Queries ... • Common Queries ... • Query 9: Spatial join • two collection R and S of spatial data • spatial predicate θ (e.g., intersects(), contains(), adjacent(), … ) • find pair of objects

  20. One Dimensional Access Methods • Important foundation of mutlidimensional methods • Access methods: • linear hashing • extendible hashing • Hierarchical (e.g., B-tree) • Hierarchical: • scalable • behave well for skewed data • nearly independent data distribution • Hashing: • performance depends on distribution, and hashing function

  21. One dimensional Access Methods ... • Linear hashing: • divide universe [A,B) of possible hash values into binary intervals • size of intervals: (B-A)/2^k and (B-A)/2^(k+1) • if a bucket is full, split an interval (not necessarily the same bucket) to two equal subintervals • no guarantee for relieving overflowed bucket • put overload of a bucket in overflow page, link it to original • Extendible hashing: • binary intervals (cells), like linear • use index in a directory for each cell, no overflow pages • if a bucket at maximal depth exceeds capacity, splits all cells into two • for skewed data, this may lead to different directory entries for same data region • pro: exact match takes at most two page accesses • con: poor space utilization, non-incremental growth (doubling) • solution: bounded-index extendible hashing

  22. One dimensional Access Methods ... • B-tree: • organize data hierarchically • balanced • relate to nesting of intervals • interior nodes: point to immediate mutually disjoint descendants • leaf nodes: data items • comparison: • B-tree: adaptive (size of interval depends on data) • hash methods outperform B-tree for uniformly distributed data for: • exact match queries, insertion, deletion

  23. Main Memory Structures • Early methods • only work in main memory, not optimized for secondary memory • are adapted in real methods • our running example: • points, polygons (their centroids, or MBBs)

  24. Main Memory Structures ... • k-d-tree • binary search tree • can only handle points (polygons with their centroids) • recursively divide the universe using (d-1)dimensional hyperplanes • hyperplanes: • iso-oriented • contain at least on data point • Interior nodes has one or two descendants, used to guide search • insert and search are straight forward • delete may require reorganizing a sub-tree • cons: • structure sensitive to ordering of insertion • data is all over the tree

  25. Main Memory Structures ... • k-d-tree

  26. Main Memory Structures ... • Adaptive k-d-tree • solves the problems of k-d-tree • tries to splits with about the same number of elements in each side • hyperplanes need not contain a data point • split till each subspace has some specific number of elements • result: • splitting points not part of the data • all data in leaves • interior nodes contain dimension and coordinates of split • cons: • difficult to keep the tree balanced (when frequent inserts and deletes) • tree still depends on order of insertion • pro: • very well for static environments • con for all k-d-trees: • for some certain distributions, no hyper-plane can split the data points evenly

  27. Main Memory Structures ... • Adaptive k-d-tree

  28. Main Memory Structures ... • BSP-tree (binary space partitioning) • like k-d-tree splits the world • splits independent of other subspaces • hyper-planes have arbitrary orientations (not iso-oriented) • decomposes until number of objects is less than a threshold • interior node: hyper-planes • leaf nodes: subspaces, links to contained objects • search for a point: straight forward • range search: generalization of point search • cons: • generally not balanced, may have very deep subtress • higher space requirements (info of the hyperplanes) • pros: • adapt well for different distributions

  29. Main Memory Structures ... • BSP-tree (binary space partitioning)

  30. Main Memory Structures ... • BD-tree • binary tree • represents subdivisions of data into interval shaped regions • encode regions in bit strings (DZ expressions, z values, ST_Morton_Number) • associate bit strings to nodes • finding z values: • left and down of hyperplanes : 0 • right and up of hyperplanes: 1 • splitting: • make interval-shaped excision: one node is excision (interval-shaped) part, the other one is remainder of the subspace • result: each data bucket contain at least 1/3 of the original entries • searching: • first finds full prefix • then traverses the tree based on prefix

  31. Main Memory Structures ... • BD-tree

  32. Main Memory Structures ... • The quadtree • closely related to k-d-tree • split with iso-oriented hyperplanes • difference: not binary tree any more • each node has 2^d descendants (in d dimensional space) • each node is interval shaped partitions • partitions don’t have to be equal size • decompose till a threshold • many variants • search is similar to k-d-tree • for point query: traverses one subtree • for region query: goes to often more than one subtree • cons: • not necessarily balanced • deeper subtrees for dense regions

  33. Main Memory Structures ... • point quadtree • constructed by inserting points one by one • how to insert: • search for a point • if not found, add it to the leaf node which search is stopped at • divide corresponding partition to 2^d subspaces, new point in center • deletion: requires reconstructing subtree

  34. Main Memory Structures ... • region quadtree • regular decomposition of the universe • result: 2^d subspaces are equal • very good for search operations

  35. Point Access Methods • Previous access methods designed for main memory • Can be used for secondary memory, but performance is very below optimum (no control over how OS accesses disks) • PAMs: organize points in to buckets, buckets in to subspaces • Different approaches for PAMs: • hashing (extended, linear) • hierarchical (tree based) • space-filling curves • lots of hybrid methods are also available

  36. Point Access Methods ... • Another taxonomy for PAMs: based on properties of buckets: • disjoint of overlapped • cover complete universe, or partial cover • interval (box), or arbitrary shapes

  37. Point Access Methods ... • Multidimensional Extended Hashing • Heuristics: hashing functions that preserves ordering to some extend • idea: objects close in space should be stored close in disk • Grid File: • superimposes d-dimensional orthogonal grid on the universe. • Grids not regular, different shapes and sizes for cells • grid directory links cells to data buckets • grid directory stored in disk, grid is in main memory • result: no more than 2 disk accesses for exact match queries • insert data: if bucket is full, requires cell splitting

  38. Point Access Methods ... • Grid File • con: super-linear growth even for uniformly distributed data (lower left part, 4 cells are pointed to one bucket) • exact match point query: • use scales to locate cell containing search point • if not in memory, one disk access to get it • retrieve bucket that have the objects in cell, and compare to search point (one more access) • range query: • test all cells that overlap the search region

  39. Point Access Methods ... • EXCELL • like grid file • splits the universe regularly • if insert require cell splitting, splits all cells (double the directory size)

  40. Point Access Methods ... • Two level Grid file • idea: use a second grid file to manage grid directory • first level is root directory (coarsened level of the second level) • second level, actual grid directory • first level points to second, second to data • pro: • splits are confined to subdirectory regions, not effecting their surroundings: slower directory growth. • Second level in main memory: still 2 accesses for exact match

  41. Point Access Methods ... • Two level Grid file

  42. Point Access Methods ... • Twin Grid File • idea: second grid file: increase space utilization • objective: reduce total number of buckets (or cells) in two files • relation between files not hierarchical (like two-level) • both files spans the whole universe • data is distributed between two files dynamically • pro: • space utilization: 90% (original grid 69%) • superior to original grid for larger search spaces • con: inferior to original grid for smaller query ranges • inserting data: • if bucket overflows, first try to redistribute the data between files • could lead to bucket overflow or underflow in one • make the shift permanent if the number of buckets is minimized • otherwise split the bucket • search requires checking both files

  43. Point Access Methods ... • Twin Grid File

  44. Point Access Methods ... • Multidimensional Linear Hashing • no directory (unlike extended), or very small directory • result: • relatively little storage requirements • can keep the related information in main memory • early techniques: • fail to support range queries • multidimensional order-preserving linear hashing with partial expansions (MOLHPE)! • idea : partially extending the buckets without expanding the file size at the same time • guarantees modest file growth • very good for uniformly distributed data, fails for nonuniform distributions

  45. Point Access Methods ... • Hierarchical Access Methods • based on binary of multiway tree structure • like hashing, stores data in bucket • each bucket is leaf of a node, and a disk page • interior nodes of the tree guide search • search: top-down tree traversal • difference between different methods: characteristics of the regions

  46. Point Access Methods ... • Hierarchical Access Methods: • k-d-B-tree • combination of adaptive k-d-tree and B-tree • partition the universe like adaptive k-d • associates subspaces to tree nodes • interior nodes are intervals • nodes in same level are mutually disjoint • perfectly balanced (like B-tree) • search straightforward, like k-d-tree • insert: search, find the right bucket, if required split and move half the data to it. • Deletion: search, remove, if necessary merge node with siblings • con: no minimum space utilization can be guaranteed

  47. Point Access Methods ... • Hierarchical Access Methods: • k-d-B-tree

  48. Point Access Methods ... • Hierarchical Access Methods: • LSD tree (??) • directory is organized same as adaptive k-d-tree • better adaptation to data distribution (in compare to fixed binary partitioning) • external balancing property: heights of external subtrees differ at most by one (??) • combines two split strategies to accommodate skewed data: • data-dependent : based on data, tries to achieve most balanced structure (equal number of data in both sides of split) • distribution-dependent: split at fixed dimension and position (know distribution is assumed)

  49. Point Access Methods ... • Hierarchical Access Methods: • LSD tree (??)

  50. Point Access Methods ... • Hierarchical Access Methods: • Buddy tree: • dynamic hashing scheme with tree structure (hybrid) • tree is made by consecutive insertions • cut the universe equally with iso-oriented hyperplanes • interior nodes: a partition and an interval (MBB of points or intervals below node) • intervals in same level nodes are mutually disjoint • leaves are data (like other trees!!) • each directory node has at least two entries • so: may not be balanced • when a node splits, MBB of two intervals are computed to reflect the current situation • so: tries to achieve high selectivity at directory level • except for root, only one pointer refers to each directory page. • so: guarantees linear growth

More Related