Indexing Multidimensional Feature Spaces

Indexing Multidimensional Feature Spaces Overview of Multidimensional Index Structure Hybrid Tree, Chakrabarti et. al. ICDE 1999 Local Dimensionality Reduction, Chakrabarti et. al. VLDB 2000

Queries over Feature Spaces • Consider a d-dimensional feature space • color histogram, texture, … • Nature of Queries • range queries: objects that reside within the region specified in the query • K-nearest neighbor queries: objects that are closest to a query object based on a distance metric • Approx. nearest neighbor queries: retrieved object is within (1+ epsilon) of the real nearest neighbor. • All-pair (similarity join) queries: retrieve all pairs of objects within a epsilon threshold. • A search algorithm may include: • false positives: objects that do not meet the query condition, but are retrieved anyway. We tend to minimize false positives • false negatives: objects that meet the query condition but are not returned. Usually, approaches avoid false negatives

Approach: Utilize Single Dimensional Index • Index on attributes independently • Project query range to each attribute determine pointers. • Intersect pointers • go to the database and retrieve objects in the intersection. May result in very high I/O cost

Multiple Key Index • Index on one attribute provides pointers to an index on the other • Cannot support partial match queries on second attribute • performance of range search not much better compared to independent attribute approach • the secondary indices may be of different sizes -- specifically some of them may be very small Index on first attribute Index on second attribute

R-tree Data Structure • Extension of B-tree to multidimensional space. • Paginated, balanced, guaranteed storage utilization. • Can support both point data and data with spatial extent • Groups objects into possibly overlapping clusters (rectangles in our case) • Search for range query proceeds along all paths that overlap with the query.

R-tree Insert Object E • Step I1 • Chooseleaf L to Insert E /* find position to insert*/ • Step I2 • If L has room install E • Else SplitNode(L) • Step I3: • Adjust Tree /* propagate Changes*/ • Step I4: • if node split propagates to root adjust height of tree

ChooseLeaf • Step CL1: • Set N to be root • Step CL2: • If N is a leaf return N • Step CL3: • If N is not a root, let F be an entry whose rectangle needs least enlargement to include object • break ties by choosing smaller rectangle • Step CL4 • Set N to be child node pointed by entry F • goto Step CL2

Split Node • Given a node split it into two nodes which are each atleast half full • Multiple Objectives: • minimize overlap • minimize covered area • R-tree minimizes covered area • What is an optimal criteria??? Minimize covered area Minimize overlap

Minimizing Covered Area • Group objects into 2 parts such that the covered area is minimized • NP Hard!! • Hence use heuritics • Two heuristics explored • quadratic and linear

Basic Split Strategy • /* Divide the set of M+1 entries into 2 groups G1 and G2 */ • PickSeeds for G1 and G2 • Invoke PickNext to assign an object to a group recursively until either all objects assigned or one of the groups becomes half full. • If one group gets half full assign rest of the objects to the other group.

Quadratic Split • PickSeed: • for each pair of entries E1 and E2 compose a rectangle J including E1.rect and E2.rect • let d = area(J) - area(E1.rect) - area(E2.rect) /* d is wasted space */ • Choose the most wasteful pair with largest d as seeds for groups G1 and G2. • PickNext /*select next entry to put in a group */ • Determine cost of putting each entry in the group G1 and G2 • for each unassigned entry calculate • d1 = area increase required in the covering rectangle in Group G1 to include the entry • d2= area increase required in the covering rectangle in Group G2 to include the entry. • Select entry with greatest preference for a group • choose any entry with the maximum difference between d1 and d2

Linear Split • PickSeed • find extreme rectangles along each dimension • find entries with the highest low side and the lowest high side • record the separation • Normalize the separation by width of extent along the dimension • Choose as seeds the pair that has the greatest normalized distance along any dimension • PickNext • randomly choose entry to assign

R-tree Search (Range Search on range S) • Start from root • If node T is not leaf • check entries E in T to determine if E.rectangle overlaps S • for all overlapping entries invoke search recursively • If T is leaf • check each entry to see if it entry satisfies range query

R-tree Delete • Step D1 • find the object and delete entry • Step D2 • Condense Tree • Step D3 • if root has 1 node shorten tree height

Condense Tree • If node is underful • delete entry from parent and add to a set Q • Adjust bounding rectangle of parent • Do the above recursively for all levels • Reinsert all the orphaned entries • insert entries at the same level they were deleted.

Other Multidimensional Data Structures • Many generalizations of R-tree • different splitting criteria • different shapes of clusters (e.g., d-dimensional spheres) • adding redundancy to reduce search cost: • store objects in multiple rectangles instead of a single rectangle to reduce cost of retrieval. But now insert has to store objects in many clusters. This strategy also increases overlap causing search performance to detoriate. • Space Partitioning Data Structures • unlike R-tree which group objects into possibly overlapping clusters, these methods attempt to partition space into non-overlapping regions. • E.g., KD tree, quad tree, grid files, KD-Btree, HB-tree, hybrid tree. • Space filling curves • superimpose an ordering on multidimensional space that preserves proximity in multidimensional space. (Z-ordering, hilbert ordering) • Use a B-tree as an index on that ordering

KD-tree • A main memory data structure based on binary search trees • can be adapted to block model of storage (KD-Btree) • Levels rotate among the dimensions, partitioning the space based on a value for that dimension • KD-tree is not necessarily balanced.

X=7 X=3 X=5 X=8 KD-Tree Example X=5 y=6 y=5 Y=6 x=8 x=7 x=3 Y=2 y=2

KD-Tree Operations • Search: • straightforward…. Just descend down the tree like binary search trees. • Insertion: • lookup record to be inserted, reaching the appropriate leaf. • If room on leaf, insert in the leaf block • Else, find a suitable value for the appropriate dimension and split the leaf block

Adapting KD Tree to Block Model • Similar to B-tree, tree nodes split many ways instead of two ways • Risk: • insertion becomes quite complex and expensive. • No storage utilization guarantee since when a higher level node splits, the split has to be propagated all the way to leaf level resulting in many empty blocks. • Pack many interior nodes (forming a subtree) into a block. • Risk • it may not be feasible to group nodes at lower level into a block productively. • Many interesting papers on how to optimally pack nodes into blocks recently published.

Quad Tree • Nodes split along all dimensions simultaneously • Division fixed: by quadrants • As with KD-tree we cannot make quadtree levels uniform

Quad Tree Example X=7 X=3 SW NW SE NE X=5 X=8

Quad Tree Operations • Insert: • Find Leaf node to which point belongs • If room, put it there • Else, make the leaf an interior node and give it leaves for each quadrant. Split the points among the new leaves. • Search: • straighforward… just descend down the right subtree

Grid Files • Space Partitioning strategy but different from a tree. • Select dividers along each dimension. Partition space into cells • Unlike KD-tree dividers cut all the way. • Each cell corresponds to 1 disk page. • Many cells can point to the same page. • Cell directory potentially exponential in the number of dimensions

Grid File Implementation • Maintain linear scales for each dimension that contain split positions for the dimension • Cell directory implemented as a multidimensional array. • /* can be large and may not fit in memory */

Grid File Search • Exact Match Search: at most 2 I/Os assuming linear scales fit in memory. • First use liner scales to determine the index into the cell directory • access the cell directory to retrieve the bucket address (may cause 1 I/O if cell directory does not fit in memory) • access the appropriate bucket (1 I/O) • Range Queries: • use linear scales to determine the index into the cell directory. • Access the cell directory to retrieve the bucket addresses of buckets to visit. • Access the buckets.

Grid File Insert • Determine the bucket into which insertion must occur. • If space in bucket, insert. • Else, split bucket • how to choose a good dimension to split? • If bucket split causes a cell directory to split do so and adjust linear scales. • /* notice that cell directory split results in p^(d-1) new entries to be created in cell directory */ • insertion of these new entries potentially requires a complete reorganization of the cell directory--- expensive!!!

Grid File Insert • Inserting a new split position will require the cell directory to increase by 1 column. In d-dim space, it will cause p^(d-1) new entries to be created

Space Filling Curve • Assumption • finite precision in representing each coordinate. B A Z(A) = shuffle(x_A, y_A) = shuffle(00,11) = 0101 = 5 Z(B) = 11 = 3 (common prefix to all its blocks) Z(C1) = 0010 = 2 Z(C2) = 1000 = 8 00 01 10 11 00 01 10 11 C

Deriving Z-Values for a Region • Obtain a quad-tree decomposition of an object by recursively dividing it into blocks until blocks are homogeneous. 01 11 Objects representation is 0001, 0011,01 00 10 00 11 01 00 11

Disk Based Storage • For disk storage, represent object based on its Z-value • Use a B-tree index. • Range Query: • translate query range to Z values • search B-tree with Z-values of data regions for matches

Nearest Neighbor Search • Retrieve the nearest neighbor of query point Q • Simple Strategy: • convert the nearest neighbor search to range search. • Guess a range around Q that contains at least one object say O • if the current guess does not include any answers, increase range size until an object found. • Compute distance d’ between Q and O • re-execute the range query with the distance d’ around Q. • Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!! Why? • Issues: how to guess range, the retrieval may be sub-optimal if incorrect range guessed. Becomes a problem in high dimensional spaces.

Nearest Neighbor Search using Range Searches Distance between Q and A b Initial range search Q A Revised range search A optimal strategy that results in minimum number of I/Os possible using priority queues.

Alternative Strategy to Evaluating K-NN Mindist(Q,B) • Let Q be the query point. • Traverse nodes in the data structure in the order of MINDIST(Q,N), where • MINDIST(Q,N) = dist(Q,N), if N is an object. • MINDIST(Q,N) = minimum distance between Q and any object in N, if N is an interior node. B Q A Mindist(Q, A) Mindist(Q,C) C

MINDIST Between Rectangle and Point Q Q T Q S

Generalized Search Trees • Motivation: • disparate applications require different data structures and access methods. • Requires separate code for each data structure to be integrated with the database code • too much effort. • Vendors will not spend time and energy unless application very important or data structure has general applicability. • Generalized search trees abstract the notion of data structure into a template. • Basic observation: most data structures are similar and a lot of book keeping and implementation details are the same. • Different data structures can be seen as refinements of basic GiST structure. Refinements specified by providing a registering a bunch of functions per data structure to the GiST.

GiST supports extensibility both in terms of data types and queries • GiST is like a “template” - it defines its interface in terms of ADT rather than physical elements (like nodes, pointers etc.) • The access method (AM) can customize GiST by defining his or her own ADT class i.e. you just define the ADT class, you have your access method implemented! • No concern about search/insertion/deletion, structural modifications like node splits etc.

x=5 x+y=12 y=5 y=4 y=3 x=3 x=4 x=6 Integrating Multidimensional Index Structures as AMs in DBMS Generalized Search Trees (GiSTs) x>5 and y>4 x>4 and y£3 x£3 x£6 x+y £12 x+y >12 y£5 x>6 y>5 Data nodes containing points

Problems with Existing Approaches for Feature Indexing • Very high dimensionality of feature spaces -- e.g., shape may define a 100-D space. • Traditional multidim. data structures perform worse than linear scan at such high dimensionality. (dimensionality curse) • Arbitrary distance functions-- e.g., distance functions may change across iterations of relevance feedback. • Traditional multidim. data structures support a fixed distance measure -- usually euclidean (L2) or Lmax. • No support for Multi-point Queries -- as in query expansion. • Executing K-NN for each query point and merging results to generate K-NN for multi-point query is very expensive. • No Support for Refinement • query in the following iterations do not diverge greatly from query in previous iterations. Effort spent in previous iterations should be exploited for evaluating K-NN in future iterations

High Dimensional Feature Indexing Multidim. Data Structures • design data structures that scale to high dim. spaces • Existing proposals perform worse than linear scan over >= 10 dim. Spaces [Weber, et al., VLDB 98] • Fundamental Limitation dimensionality beyond which linear scan wins over indexing! (approx. 610) Dimensionality Reduction • transform points in high dim. space to low dim. space • works well when data correlated into a few dimensions only • difficult to manage in dynamic environments

Data Partitioning (DP) Bounding Region (BR) Based e.g., R-tree, X-tree, SS-tree, SR-tree, M-tree All k dim. used to represent partitioning Poor scalability to dimensionality due to high degree of overlap and low fanout at high dimensions seq. scan wins for > 10D Space Partitioning(SP) Based on disjoint partitioning of space e.g., KDB-tree, hB-tree, LSDh-tree, VP tree, MVP tree no overlap and fanout independent of dimensions Poor scalability to dimensionality due to either poor storage utilization or redundant information storage requirements. Classification of Multidimensional Index Structures

D B Dim2 3 2 A C 0,0 3 Dim1 Data Points Hybrid Tree: Space Partitioning (SP) instead of Data Partitioning (DP) R1 R2 R3 R4 R1 R2 R3 R4 dim=1 pos=3 Non-leaf nodes of hybrid tree organized as kd-tree dim=2 pos=2 dim=2 pos=3 Data Points Data Points Data Points A B C D

Splitting of Non-Leaf Nodes (Easy case) F F B C B C 4 3 Clean split possible without violating node utilization D E D E A A (0,0) 2 4 6 dim=1 pos=4 dim=1 pos=4 dim=2 pos=3 dim=2 pos=4 dim=2 pos=3 dim=2 pos=4 A dim=1 pos=2 F dim=1 pos=6 A dim=1 pos=2 F dim=1 pos=6 D B C E D B C E

dim=1 pos=4 dim=2 pos=3 dim=2 pos=4 dim=1 pos=2 dim=1 pos=6 dim=2 pos=5 A dim=1 pos=7 dim=2 pos=2 B C D G E F I H Splitting of Non-Leaf Nodes (Difficult case) Clean split not possible without violating node util. Always clean split; Downward cascading splits (empty nodes) Complex splits (space overhead; tree becomes large) Allow Overlap (avoid by relaxing node util, otherwise minimize overlap) (Hybrid Tree)

H I H I B C G B C G F F D D A A E E Splitting of Non-Leaf Nodes (Difficult case) 5 4 dim=2 pos=3,4 3 2 (0,0) 2 4 6 7 dim=1 pos=4,4 dim=1 pos=4,4 dim=1 pos=4,4 dim=2 pos=3,3 dim=2 pos=4,4 dim=1 pos=6,6 dim=1 pos=2,2 dim=2 pos=5,5 A dim=1 pos=2,2 dim=1 pos=6,6 dim=2 pos=5,5 A dim=2 pos=2,2 dim=1 pos=7,7 D B C G dim=1 pos=7,7 dim=2 pos=2,2 B C D G E F H I E F H I

Choosing Split dimension and position: EDA (Expected Disk Accesses) Analysis Consider a range (cube) query, side length r along each dimension Split node along this Node BR Node BR expanded by (r/2) on each side along each dimension (Minkowski Sum) r w w+r Prob. of range query accessing node (assuming (0,1) space and uniform query distribution) Prob. of range query accessing both nodes after split (increase in EDA) Choose split dimension and position that minimizes increase in EDA

Choosing Split dimension and position(based on EDA analysis) • Data Node Splitting • Spilt dimension: split along maximum spread dimension • Split position: split as close to the middle as possible (without violating node utilization) • Index Node Splitting: • Split dimension: argminjò P(r) (wj + r)/ (sj + r) dr • depends of the distribution of the query size • argminj (wj + R)/ (sj + R)when all queries are cubes with side length R • Split position: avoid overlap if possible, else minimize as much overlap as possible without violating utilization constraints

Dead Space Elimination R1 R2 Space Partitioning (Hybrid tree): Without dead space elimination D B R3 R4 A C Data Partitioning (R-tree): No dead space Space Partitioning (Hybrid tree): With dead space elimination D B A C

Dead Space Elimination • Live space encoding using 3 bit precision (ELSPRECISION=3) • Encoded Live Space (ELS) BR = (001,001,101,111) • Bits required = 2*numdims*ELSPRECISION • Compression = ELSPRECISION/32 • Only applied to leaf nodes 111 110 101 100 011 010 001 000 000 001 010 011 100 101 110 111

Tree operations • Search: • Point, Range, NN-search, distance-based search as in DP-techniques • Reason: BR representation can be derived from kd-tree representation • Exploit tree organization (pruning) for fast intra-node search • Insertion: • recursively choose space partition that contains the point • break tries arbitrarily • no volume computation (otherwise floating point exception at high dims) • Deletion: • details in thesis

Indexing Multidimensional Feature Spaces