M-Tree: An Efficient Access Method for Similarity Search in Metric Space

M-Tree: An Efficient Access Method for Similarity Search in Metric Space Presenters: Amool Gupta Amit Sharma

MOTIVATION • Basic problem that it addresses?(Why) • Other techniques to solve same problem and how this one is step ahead? • Basic Fundamentals of this Indexing structure

Similarity Search Problem

Similarity Searching • Effectiveness - The way of formulating the similarity measures a model of human perception • Efficiency - The way of achieving the required performance over huge volumes of data – index structure

Examples of Distance Functions • Lp metric function( vectors) • L1 Manhattan distance • Euclidean Distance • Linfinity • Edit Distance (for String) • Hausdorff distance • Earth movers distance • Quadratic form distance

Metric Spaces-An abstraction of Similarity • A metric space M = (D,d) is a pair, where • D is a domain (“universe”) of values, and • d is a distance function that, ∀ x,y,z ∈ U, satisfies the metric axioms: • d(x,y) ≥ 0, d(x,y) = 0 ⇔ x = y (positivity) • d(x,y) = d(y,x) (symmetry) • d(x,y) ≤ d(x,z) + d(z,y) (triangle inequality) • All the distance functions seen in the previous examples are metrics, and so are the (weighted) Lp-norms • The only distance seen so far that does not fit the metric framework is the DTW Metric indexes only use the metric axioms to organize objects, and exploit the triangle inequality to prune the search space

Limitations of SAMs • SAMs are limited to indexing of DB Objects represented by means of feature values in Multi-dimensional vector space (we need more generic indexing strategy) • Dissimilarity of object measured by Lp distance between feature values • Assumes distance computation Trivial • Limitations of Metric Tress • Does not support dynamic database environment • Reduces distance computations but Pays no attention to I/O costs

What is a relative distance? OA + AB = OB AB = OA – OB AB = relative position of B w.r.t A B A O

M-Tree • Key ideas is to Some how reduce distance computation and at same time reduce I/O. • M-Tree partition objects on the basis of their relative distance as measured by specific distance function and stores this objects into nodes. r(Or) P(Or) Or root

M-Tree Structure • Leaf Nodes: stores all indexed db objects by their key or feature values. • Internal Nodes: Called routing nodes. Routing objectsOr is associated with • Or feature value of DB object • Ptr(T(Or)) = pointer to root of sub tree T(Or) • r(Or) = covering radius or maximum relative distance of objects in sub tree T(Or) from routing object Or • d(Or , P(Or)) = distance of routing object from its parent object P(Or)

M-Tree Structure Leaf Node: Entry for database object. • Oj feature value of DB object. • oid(Oj) object key • d(Oj P(Oj)) = distance of Oj from its parent P(Oj)

Processing Queries • Generally SAM try to prune tree for a given Query and main emphasis is on developing efficient pruning method which reduces no of disk access but once a tree is pruned it is required to compute distance of query point Q from each point in pruned tree. • On the contrary emphasis of M- Tree is on pruning as well as to reduce computation of distance which is achieved by maximizing use of pre computed distance stored in nodes of M-Tree

Range Query Given query point Q , Maximum search distance r(Q) Range query range(Q, r(Q)) is all objects Oj such that d(Oj , Q ) < r(Q) r(Q) Or is of our interest if intersection occurs How To detect intersection using pre Computed distances? If relative distance between Q and Or is Less then sum of covering radii of two Intersection is found. Q r(Or) P(Or) Or root

Range Query Leaf node Object in leaf node is a solution to range Query if it lies in its covering radii. We can again use relative distance to Find weather object lies in covering radii Or not r(Q) Q P(Or) Oj root

Algorithm for Range Queries

K nearest neighbors queries • Given query point Q , • An integer k > = 1 • k-NN is NN(Q,k) is k indexed objects which have shortest distance to Q Q Max Bound Min Bound r(Or) P(Or) Or root

SPLIT MANAGEMENT • M-Tree grows bottom-up fashion • Overflow of node N is managed by splitting N into two new nodes N and N’(newly created) • PARTITIONING: Distributing entries are among N and N’ • PROMOTE: Two entries are promoted as routing objects and moved to parent level

SPLIT MANAGEMENT • If the split node is a leaf, then the covering radius of a promoted object, say Op1, is set to r(Op1) = max{d(Oj,Op1 )|Oj ∈ N1} • whereas if overflow occurs in an internal node r(Op1 ) = max{d(Or,Op1) + r(Or)|Or ∈ N1}

SPLIT POLICIES • Specific implementation of Promote and Partition method defines a split policy • Ideal split policy should promote two objects and partition other objects so obtained regions have - Minimum volume - Minimum Overlap • How it is different from SAM??

PROMOTE: Choosing Routing objects • M_RAD minimum Radii sum • mM_RAD minimizes maximum of two Radii • M_LB_DIST maximum lower bound on distance • RANDOM • SAMPLING

PARTITIONING-Distribution of Entries • Generalized Hyperplane(Unbalanced split (why?)): Assign each object Oj ∈N to the nearest routing object: • if d(Oj,Op1 ) ≤ d(Oj,Op2 ) then • assign Oj to N1, else • assign Oj to N2. • Balanced: Compute d(Oj,Op1) and d(Oj,Op2 ) for all Oj ∈ N. Repeat until N is empty: • Assign to N1 the nearest neighbor of Op1 in N and remove it from N; • Assign to N2 the nearest neighbor of Op2 in N and remove it from N.

Experimental Results • Assumed constant node size • Tested all split policies • Results • Balanced partition method has shown to put significant overhead and increased th I/O cost • Fastest split policy observed to be RANDOM and slowest m_RAD • Average volume covered per page(quality of tree construction) M_LB_DIST proved effective

Experimental Results(2)

I/O cost

Avg Volume per page

I/O cost

I/O cost for M-Tree & R*-Tree

Thanks

M-Tree: An Efficient Access Method for Similarity Search in Metric Space