Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing

Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing

Types of Spatial Data • Point Data • Points in a multidimensional space • E.g., Raster data such as satellite imagery, where each pixel stores a measured value • E.g., Feature vectors extracted from text • Region Data • Objects have spatial extent with location and boundary • DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data.

Spatial Indexing • Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) • PAM: index only point data • Hierarchical (tree-based) structures • Multidimensional Hashing • Space filling curve • SAM: index both points and regions • Transformations • Overlapping regions • Clipping methods (non-overlapping) • Data partitioning vs Space partitioning

Types of Spatial Queries • Spatial Range Queries • Find all cities within 50 miles of Troy • Query has associated region (location, boundary) • Answer includes overlapping or contained data regions • Nearest-Neighbor Queries • Find the 10 cities nearest to Troy • Results must be ordered by proximity • Spatial Join Queries • Find all cities near a lake • Expensive, join condition involves regions and proximity

Applications of Spatial Data • Geographic Information Systems (GIS) • E.g., ESRI’s ArcInfo; OpenGIS Consortium • Geospatial information • All classes of spatial queries and data are common • Computer-Aided Design/Manufacturing • Store spatial objects such as surface of airplane fuselage • Range queries and spatial join queries are common • Multimedia Databases • Images, video, text, etc. stored and retrieved by content • First converted to feature vector form; high dimensionality • Nearest-neighbor queries are the most common

High Dimensional Indexing • Requirements • Fast range/window query search (range query • Fast similarity search • Similarity range query • K-nearest neighbour query (KNN query)

Nasdaq Feature Base Similarity Search Complex Objects Feature Vectors Index for range/ similarity Search Feature extraction and transformation Index construction Similarity Queries

Retrieval by Colour Similarity Search based on sample image in color composition Given a sample image

Query Requirement • Window/Range query: Retrieve data points fall within a given range along each dimension. • Designed to support range retrieval, facilitate joins and similarity search (if applicable).

Query Requirement • Similarity queries: • Similarity range and KNN queries • Similarity range query: Given a query point, find all data points within a given distance r to the query point. • KNN query: Given a query point, • find the K nearest neighbours, • in distance to the point. r Kth NN

Single-Dimensional Indexes • B+ trees are fundamentally single-dimensional indexes. • When we create a composite search key B+ tree, e.g., an index on <age, sal>, we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal. 80 70 60 Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75> 50 40 B+ tree order 30 20 10 11 12 13

80 70 Spatial clusters 60 50 40 30 20 B+ tree order 10 11 12 13 Multidimensional Indexes • A multidimensional index clusters entries so as to exploit “nearness” in multidimensional space. • Keeping track of entries and maintaining a balanced index structure presents a challenge! Consider entries: <11, 80>, <12, 10> <12, 20>, <13, 75>

Motivation for Multidimensional Indexes • Spatial queries (GIS, CAD). • Find all hotels within a radius of 5 miles from the conference venue. • Find the city with population 500,000 or more that is nearest to Kalamazoo, MI. • Find all cities that lie on the Nile in Egypt. • Find all parts that touch the fuselage (in a plane design). • Similarity queries (content-based retrieval). • Given a face, find the five most similar faces. • Multidimensional range queries. • 50 < age < 55 AND 80K < sal < 90K

What’s the difficulty? • An index based on spatial location needed. • One-dimensional indexes don’t support multidimensional searching efficiently. • Hash indexes only support point queries; want to support range queries as well. • Must support inserts and deletes gracefully. • Ideally, want to support non-point data as well (e.g., lines, shapes).

Multi-dimensional Indexes • Multi-key Indexes • Grid Files • Partitioned Hash Indexes • kd-Trees • Quad Trees • R Trees • Bitmap indexes

Partitioned hash function 010110 1110010 Idea: Key1 Key2 h1 h2

<Fred> <Joe><Sally> Example: h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100 h2(20k) =11 101 h2(30k) =01 110 h2(40k) =00 111 . . <Fred,toy,10k>,<Joe,sales,10k> <Sally,art,30k> Insert

h1(toy) =0 000 • h1(sales) =1 001 • h1(art) =1 010 • . 011 • . • h2(10k) =01 100 • h2(20k) =11 101 • h2(30k) =01 110 • h2(40k) =00 111 • . • . • Find Emp. with Dept. = Sales  Sal=40k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy>

h1(toy) =0 000 • h1(sales) =1 001 • h1(art) =1 010 • . 011 • . • h2(10k) =01 100 • h2(20k) =11 101 • h2(30k) =01 110 • h2(40k) =00 111 • . • . • Find Emp. with Sal=30k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy>

h1(toy) =0 000 • h1(sales) =1 001 • h1(art) =1 010 • . 011 • . • h2(10k) =01 100 • h2(20k) =11 101 • h2(30k) =01 110 • h2(40k) =00 111 • . • . • Find Emp. with Dept. = Sales <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy>

Grid File • Hashing methods for multidimensional points (extension of Extensible hashing) • Idea: Use a grid to partition the space each cell is associated with one page • Two disk access principle (exact match)

Grid File • Start with one bucket for the whole space. • Select dividers along each dimension. Partition space into cells • Dividers cut all the way. • Each cell corresponds to 1 disk page. • Many cells can point to the same page. • Cell directory potentially exponential in the number of dimensions

Grid File Implementation • Dynamic structure using a grid directory • Grid array: a 2 dimensional array with pointers to buckets (this array can be large, disk resident) G(0,…, nx-1, 0, …, ny-1) • Linear scales: Two 1 dimensional arrays that are used to access the grid array (main memory) X(0, …, nx-1), Y(0, …, ny-1)

Example Buckets/Disk Blocks Grid Directory Linear scale Y Linear scale X

Grid File Search • Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. • First use liner scales to determine the index into the cell directory • access the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory) • access the appropriate bucket (1 I/O) • Range Queries: • use linear scales to determine the index into the cell directory. • Access the cell directory to retrieve the bucket addresses of buckets to visit. • Access the buckets.

Grid File Insertions • Determine the bucket into which insertion must occur. • If space in bucket, insert. • Else, split bucket • how to choose a good dimension to split? • If bucket split causes a cell directory to split do so and adjust linear scales. • insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!

Grid File Deletions • Deletions may decrease the space utilization. Merge buckets • We need to decide which cells to merge and a merging threshold • Buddy system and neighbor system • A bucket can merge with only one buddy in each dimension • Merge adjacent regions if the result is a rectangle

A A A 1 5 2 4 6 3 Grid File Example (N=6) 1 2 3 4 5 6

A B A A A B A A B 1 12 11 10 9 6 7 5 4 3 2 8 1 1 2 4 3 2 3 6 5 4 7 5 6 Grid File Example (N=6) 8 10 11 12 9

A A B B A A B B B C A A C B B 7 10 9 4 11 14 13 8 5 12 1 6 3 15 2 3 1 2 1 2 5 4 4 3 7 5 6 6 10 8 7 13 9 9 11 11 8 12 12 10 C Grid File Example (N=6) 14 15

A A D B B A B B C A A A D A A A B B B C B C C B 14 15 1 10 11 6 9 12 8 7 5 4 2 13 14 15 16 3 1 2 1 2 3 1 1 2 1 2 3 1 3 7 2 8 4 5 5 7 14 2 7 4 4 3 4 5 3 4 5 6 10 3 13 6 10 6 10 8 15 5 6 6 8 9 7 7 13 4 9 16 9 9 13 11 11 11 11 8 5 12 10 12 12 6 12 A D B C C C C B Grid File Example (N=6)

H A D F B A H D B F y4 I y3 A A C E C E I I G D G B B B C C B F F F y2 E G y1 C x1 x2 x3 x4 Grid File Example (N=6)

Kd-Trees • Binary partitioning of space. Split of the form a < V & a >= V for some attribute (Internal nodes) • The dimensions to “cut” or “split” alternate among all dimensions • Doesn’t have to span the whole dim (unlike Grid Files) • Leaves are blocks that hold the points

A B B B B C C C C x1 y1 D D D E E B y1 C A B C D E F C x2 D D y2 E y2 E F F x1 x2 B B B C C C D D E Kd-Trees Example kd

kd A x2 1 2 3 y9 4 5 y8 y8 6 7 8 9 x5 y7 y6 10 y2 y2 11 12 y5 13 17 x3 y3 x4 x9 x6 7 y5 y7 x1 y9 y1 x8 9 y6 2 y4 y3 6 13 14 x7 x8 15 16 18 15 16 1 10 3 8 21 19 20 y2 17 18 19 20 y4 5 4 12 y1 21 11 14 x1 x2 x3 x4 x5 x6 x7 x8 x9 KdTrees Example

kdB x2 y8 B C B B 1 2 3 x5 C y9 4 5 y2 D y8 C 6 7 8 9 E F E y7 y6 y2 10 11 12 y5 17 x3 13 y3 x4 x9 x6 7 y5 y7 x1 y9 y1 x8 9 y6 2 y4 6 13 E y3 D F x7 x8 18 15 16 14 15 16 1 10 3 8 21 19 20 y2 y4 17 5 4 12 18 19 20 F y1 11 14 21 x1 x2 x3 x4 x5 x6 x7 x8 x9 D kDB Trees Example

A 1 2 3 4 5 NW SW NE SE 6 7 8 13 14 9 10 1 B C F 11 12 15 16 19 17 18 E 2 3 4 5 6 D 11 12 13 14 19 15 16 17 18 7 8 9 10 Region Quadtree

Point Quad-tree (50,50) (75,75) (25,25) (75,25) (20,88) (0,100) (100,100) (88,65) (52,15) (92,1) (0,0) (100,0)

Root of R Tree Y Leaf level X The R-Tree • The R-tree is a tree-structured index that remains balanced on inserts and deletes. • Each key stored in a leaf entry is intuitively a box, or collection of intervals, with one interval per dimension. • Example in 2-D:

R-Tree Properties • Leaf entry = < n-dimensional box, rid > • key value is a box. • Box is the tightest bounding box for a data object. • Non-leaf entry = < n-dim box, ptr to child node > • Box covers all boxes in child node (in fact, subtree). • All leaves at same distance from root. • Nodes can be kept 50% full (except root). • Can choose a parameter m that is <= 50%, and ensure that every node is at least m% full.

Example of an R-Tree Leaf entry Index entry R1 R4 Spatial object approximated by bounding box R8 R11 R3 R5 R13 R9 R8 R14 R10 R12 R7 R18 R17 R6 R16 R19 R15 R2

Example R-Tree (Contd.) R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 R17 R18 R19 R16

Search for Objects Overlapping Box Q Start at root. 1. If current node is non-leaf, for each entry <E, ptr>, if boxE overlaps Q, search subtree identified by ptr. 2. If current node is leaf, for each entry <E, rid>, if E overlaps Q, rid identifies an object that might overlap Q. Note: May have to search several subtrees at each node! (In contrast, a B-tree equality search goes to just one leaf.)

Improving Search Using Constraints • It is convenient to store boxes in the R-tree as approximations of arbitrary regions, because boxes can be represented compactly. • But why not use convex polygons to approximate query regions more accurately? • Will reduce overlap with nodes in tree, and reduce the number of nodes fetched by avoiding some branches altogether. • Cost of overlap test is higher than bounding box intersection, but it is a main-memory cost, and can actually be done quite efficiently. Generally a win.

Insert Entry <B, ptr> • Start at root and go down to “best-fit” leaf L. • Go to child whosebox needs least enlargement to cover B; resolve ties by going to smallest area child. • If best-fit leaf L has space, insert entry and stop. Otherwise, split L into L1 and L2. • Adjust entry for L in its parent so that the box now covers (only) L1. • Add an entry (in the parent node of L) for L2. (This could cause the parent node to recursively split.)

Splitting a Node During Insertion • The entries in node L plus the newly inserted entry must be distributed between L1 and L2. • Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries. • Idea: Redistribute so as to minimize area of L1 plus area of L2. • Exhaustive algorithm is too slow; • quadratic and linear heuristics are • described in the paper. GOOD SPLIT! BAD!

R-Tree Variants • The R* tree uses the concept of forced reinserts to reduce overlap in tree nodes. When a node overflows, instead of splitting: • Remove some (say, 30% of the) entries and reinsert them into the tree. • Could result in all reinserted entries fitting on some existing pages, avoiding a split. • R* trees also use a different heuristic, minimizing box perimeters rather than box areas during insertion. • Another variant, the R+ tree, avoids overlap by inserting an object into multiple leaves if necessary. • Searches now take a single path to a leaf, at cost of redundancy.

GiST • The Generalized Search Tree (GiST) abstracts the “tree” nature of a class of indexes including B+ trees and R-tree variants. • Striking similarities in insert/delete/search and even concurrency control algorithms make it possible to provide “templates” for these algorithms that can be customized to obtain the many different tree index structures. • B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs. • GiST provides an alternative for implementing other tree indexes in an ORDBS.

Comments on R-Trees • Deletion consists of searching for the entry to be deleted, removing it, and if the node becomes under-full, deleting the node and then re-inserting the remaining entries. • Overall, works quite well for 2D and 3D datasets. Several variants (notably, R+ and R* trees) have been proposed; widely used. • Can improve search performance by using a convex polygon to approximate query shape (instead of a bounding box) and testing for polygon-box intersection.

Bitmap Index • Bitmap index: specialized index that takes advantage • Read-mostly data: data produced from scientific experiments can be appended in large groups • Fast operations • “Predicate queries” can be performed with bitwise logical operations • Predicate ops: =, <, >, <=, >=, range, • Logical ops: AND, OR, XOR, NOT • They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory

Multidimensional Indexing: Spatial Data Management & High Dimensional Indexing