1 / 30

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

M-Tree: An Efficient Access Method for Similarity Search in Metric Space. Presenters: Amool Gupta Amit Sharma. MOTIVATION. Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step ahead? Basic Fundamentals of this Indexing structure.

rusty
Télécharger la présentation

M-Tree: An Efficient Access Method for Similarity Search in Metric Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. M-Tree: An Efficient Access Method for Similarity Search in Metric Space Presenters: Amool Gupta Amit Sharma

  2. MOTIVATION • Basic problem that it addresses?(Why) • Other techniques to solve same problem and how this one is step ahead? • Basic Fundamentals of this Indexing structure

  3. Similarity Search Problem

  4. Similarity Searching • Effectiveness - The way of formulating the similarity measures a model of human perception • Efficiency - The way of achieving the required performance over huge volumes of data – index structure

  5. Examples of Distance Functions • Lp metric function( vectors) • L1 Manhattan distance • Euclidean Distance • Linfinity • Edit Distance (for String) • Hausdorff distance • Earth movers distance • Quadratic form distance

  6. Metric Spaces-An abstraction of Similarity • A metric space M = (D,d) is a pair, where • D is a domain (“universe”) of values, and • d is a distance function that, ∀ x,y,z ∈ U, satisfies the metric axioms: • d(x,y) ≥ 0, d(x,y) = 0 ⇔ x = y (positivity) • d(x,y) = d(y,x) (symmetry) • d(x,y) ≤ d(x,z) + d(z,y) (triangle inequality) • All the distance functions seen in the previous examples are metrics, and so are the (weighted) Lp-norms • The only distance seen so far that does not fit the metric framework is the DTW Metric indexes only use the metric axioms to organize objects, and exploit the triangle inequality to prune the search space

  7. Limitations of SAMs • SAMs are limited to indexing of DB Objects represented by means of feature values in Multi-dimensional vector space (we need more generic indexing strategy) • Dissimilarity of object measured by Lp distance between feature values • Assumes distance computation Trivial • Limitations of Metric Tress • Does not support dynamic database environment • Reduces distance computations but Pays no attention to I/O costs

  8. What is a relative distance? OA + AB = OB AB = OA – OB AB = relative position of B w.r.t A B A O

  9. M-Tree • Key ideas is to Some how reduce distance computation and at same time reduce I/O. • M-Tree partition objects on the basis of their relative distance as measured by specific distance function and stores this objects into nodes. r(Or) P(Or) Or root

  10. M-Tree Structure • Leaf Nodes: stores all indexed db objects by their key or feature values. • Internal Nodes: Called routing nodes. Routing objectsOr is associated with • Or feature value of DB object • Ptr(T(Or)) = pointer to root of sub tree T(Or) • r(Or) = covering radius or maximum relative distance of objects in sub tree T(Or) from routing object Or • d(Or , P(Or)) = distance of routing object from its parent object P(Or)

  11. M-Tree Structure Leaf Node: Entry for database object. • Oj feature value of DB object. • oid(Oj) object key • d(Oj P(Oj)) = distance of Oj from its parent P(Oj)

  12. Processing Queries • Generally SAM try to prune tree for a given Query and main emphasis is on developing efficient pruning method which reduces no of disk access but once a tree is pruned it is required to compute distance of query point Q from each point in pruned tree. • On the contrary emphasis of M- Tree is on pruning as well as to reduce computation of distance which is achieved by maximizing use of pre computed distance stored in nodes of M-Tree

  13. Range Query Given query point Q , Maximum search distance r(Q) Range query range(Q, r(Q)) is all objects Oj such that d(Oj , Q ) < r(Q) r(Q) Or is of our interest if intersection occurs How To detect intersection using pre Computed distances? If relative distance between Q and Or is Less then sum of covering radii of two Intersection is found. Q r(Or) P(Or) Or root

  14. Range Query Leaf node Object in leaf node is a solution to range Query if it lies in its covering radii. We can again use relative distance to Find weather object lies in covering radii Or not r(Q) Q P(Or) Oj root

  15. Algorithm for Range Queries

  16. K nearest neighbors queries • Given query point Q , • An integer k > = 1 • k-NN is NN(Q,k) is k indexed objects which have shortest distance to Q Q Max Bound Min Bound r(Or) P(Or) Or root

  17. SPLIT MANAGEMENT • M-Tree grows bottom-up fashion • Overflow of node N is managed by splitting N into two new nodes N and N’(newly created) • PARTITIONING: Distributing entries are among N and N’ • PROMOTE: Two entries are promoted as routing objects and moved to parent level

  18. SPLIT MANAGEMENT • If the split node is a leaf, then the covering radius of a promoted object, say Op1, is set to r(Op1) = max{d(Oj,Op1 )|Oj ∈ N1} • whereas if overflow occurs in an internal node r(Op1 ) = max{d(Or,Op1) + r(Or)|Or ∈ N1}

  19. SPLIT POLICIES • Specific implementation of Promote and Partition method defines a split policy • Ideal split policy should promote two objects and partition other objects so obtained regions have - Minimum volume - Minimum Overlap • How it is different from SAM??

  20. PROMOTE: Choosing Routing objects • M_RAD minimum Radii sum • mM_RAD minimizes maximum of two Radii • M_LB_DIST maximum lower bound on distance • RANDOM • SAMPLING

  21. PARTITIONING-Distribution of Entries • Generalized Hyperplane(Unbalanced split (why?)): Assign each object Oj ∈N to the nearest routing object: • if d(Oj,Op1 ) ≤ d(Oj,Op2 ) then • assign Oj to N1, else • assign Oj to N2. • Balanced: Compute d(Oj,Op1) and d(Oj,Op2 ) for all Oj ∈ N. Repeat until N is empty: • Assign to N1 the nearest neighbor of Op1 in N and remove it from N; • Assign to N2 the nearest neighbor of Op2 in N and remove it from N.

  22. Experimental Results • Assumed constant node size • Tested all split policies • Results • Balanced partition method has shown to put significant overhead and increased th I/O cost • Fastest split policy observed to be RANDOM and slowest m_RAD • Average volume covered per page(quality of tree construction) M_LB_DIST proved effective

  23. Experimental Results(2)

  24. I/O cost

  25. Avg Volume per page

  26. I/O cost

  27. I/O cost for M-Tree & R*-Tree

  28. Thanks

More Related