1 / 64

Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications

Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications. Pavel Zezula Faculty of Informatics Masaryk University, Brno. Outline of the talk. Why similarity Principles of metric similarity searching The MUFIN approach Demo applications Future directions.

dotty
Télécharger la présentation

Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi Feature Indexing Network MUFINSimilarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno MUFIN: Multi Feature Indexing Network

  2. Outline of the talk • Why similarity • Principles of metric similarity searching • The MUFIN approach • Demo applications • Future directions MUFIN: Multi Feature Indexing Network

  3. Real-Life MotivationThe social psychology view • Any event in the history of organism is, in a sense, unique. • Recognition, learning, and judgment presuppose an ability to categorize stimuli and classify situations by similarity. • Similarity (proximity, resemblance, communality, representativeness, psychologicaldistance, etc.) is fundamental to theories of perception, learning, judgment, etc. MUFIN: Multi Feature Indexing Network

  4. Contemporary Networked MediaThe digital data view • Almost everything that we see, read, hear, write, measure, or observe can be digital. • Users autonomouslycontribute to production of global media and the growth is exponential. • Sites like Flickr, YouTube, Facebook host user contributed content for a variety of events. • The elements of networked media are related by numerous multi-facet links of similarity. MUFIN: Multi Feature Indexing Network

  5. Examples with Similarity • Does the computer disk of a suspected criminal contain illegal multimedia material? • What are the stocks with similar price histories? • Which companies advertise their logos in the direct TV transmission of football match? • Is it the situation on the web getting close to any of the network attacks which resulted in significant damage in the past? MUFIN: Multi Feature Indexing Network

  6. Challenge • Networked media is getting close to the human “fact-bases” • the gap between physical and digital has blurred • Similaritydatamanagement is needed to connect, search, filter, merge, relate, rank, cluster, classify, identify, or categorize objects across various collections. WHY? It is the similarity which is in the world revealing. MUFIN: Multi Feature Indexing Network

  7. Limitations: Data Types We have • Attributes • Numbers, strings, etc. • Text (text-based) • Documents, annotations We need • Multimedia • Image, video, audio • Security • Biometrics • Medicine • EKG, EEG, EMG, EMR, CT, etc. • Scientific data • Biology, chemistry, physics, life sciences, economics • Others • Motion, emotion, events, etc. MUFIN: Multi Feature Indexing Network

  8. Limitations: Models of Similarity We have • Simple geometric models, typically vector spaces We need • More complex model • Non metric models • Asymmetric similarity • Subjective similarity • Context aware similarity • Complex similarity • Etc. MUFIN: Multi Feature Indexing Network

  9. Limitations: Queries We have • Simple query • Nearest neighbor • Range We need • More query types • Reverse NN, distinct NN, similarity join • Other similarity-based operations • Filtering, classification, event detection, clustering, etc. • Similarity algebra • May become the basis of a “Similarity Data Management System” MUFIN: Multi Feature Indexing Network

  10. Limitations: Implementation Strategies We have • Centralized or parallel processing We need • Scalable and distributed architectures • MapReduce like approaches • P2P architectures • Cloud computing • Self-organized architectures • Etc. MUFIN: Multi Feature Indexing Network

  11. Search Strategy Evolution Scalability • data volume - exponential • number of users (queries) • variety of data types • multi-lingual, -feature –modal queries well established cutting-edge research high Determinism exact match ► similarity precise ► approximate same answer ► good answer; recommendation fixed query ► personalized; context aware fixed infrastr. ► dynamic mapping; mobile dev. peer-to-peer centralized parallel self-organized distributed grade low MUFIN: Multi Feature Indexing Network

  12. Similarity Data Management System findability modelling infrastructure retrieval stimuli matching extraction Similarity Data Management System similarity effectiveness efficiency execution evaluation algebra MUFIN: Multi Feature Indexing Network

  13. Metric Search Grows in Popularity Hanan Samet Foundation of Multidimensional and Metric Data Structures Morgan Kaufmann, 2006 P. Zezula, G. Amato, V. Dohnal, and M. Batko Similarity Search: The Metric Space Approach Springer, 2006 MUFIN: Multi Feature Indexing Network

  14. SEARCH data & queries index structure infrastructure The MUFIN Approach MUFIN: MUlti-Feature Indexing Network Extensibility metric space Scalability P2P structure Independence Infrastructure as a service MUFIN: Multi Feature Indexing Network

  15. Extensibility: Metric Abstraction of Similarity • Metric space:M = (D,d) • D– domain • distance function d(x,y) x,y,z  D • d(x,y) > 0 - non-negativity • d(x,y) = 0 x = y - identity • d(x,y) = d(y,x) - symmetry • d(x,y)≤ d(x,z)+ d(z,y) - triangle inequality MUFIN: Multi Feature Indexing Network

  16. Examples of Distance Functions • LpMinkovski distance (for vectors) • L1 – city-block distance • L2 – Euclidean distance • L¥– infinity • Edit distance (for strings) • minimal number of insertions, deletions and substitutions • d(‘application’, ‘applet’) = 6 • Jaccard’s coefficient (for sets A,B) MUFIN: Multi Feature Indexing Network

  17. Examples of Distance Functions • Mahalanobisdistance • for vectors with correlated dimensions • Hausdorff distance • for sets with elements related by another distance • Earth movers distance • primarily for histograms (sets of weighted features) • and many others MUFIN: Multi Feature Indexing Network

  18. Similarity Search Problem • For XDin metric space M, pre-process Xso that the similarity queries are executed efficiently. No total ordering exists! MUFIN: Multi Feature Indexing Network

  19. Similarity Queries • Range query • Nearest neighbor query • Similarity join • Combined queries • Complex queries MUFIN: Multi Feature Indexing Network

  20. q r Similarity Range Query • range query • R(q,r) = { x X| d(q,x)≤ r } … all museums up to 2km from my hotel … MUFIN: Multi Feature Indexing Network

  21. q Nearest Neighbor Query • the nearest neighbor query • NN(q) = x • x X, "y  X, d(q,x)≤ d(q,y) • k-nearest neighbor query • k-NN(q,k) = A • A  X, |A| = k • x  A, y X – A, d(q,x)≤ d(q,y) … five closest museums to my hotel … k=5 MUFIN: Multi Feature Indexing Network

  22. m Similarity Join Queries • similarity join of two data sets • similarity self join  X = Y …pairs of hotels and museums which are five minutes walk apart … MUFIN: Multi Feature Indexing Network

  23. Combined Queries • Range + Nearest neighbors • Nearest neighbor + similarity joins • by analogy MUFIN: Multi Feature Indexing Network

  24. Complex Queries • Find the best matches of circularshape objects with redcolor • The best match for circular shape or red color needs not be the best match combined • A0 algorithm • Threshold algorithm MUFIN: Multi Feature Indexing Network

  25. Partitioning Principles • Given a set XD in M=(D,d), basic partitioning principles have been defined: • Ball partitioning • Generalized hyper-plane partitioning • Excluded middle partitioning • Clustering MUFIN: Multi Feature Indexing Network

  26. dm p Ball Partitioning • Inner set: { x X| d(p,x)≤ dm } • Outer set: { x X| d(p,x) > dm } MUFIN: Multi Feature Indexing Network

  27. p2 p1 Generalized Hyper-plane • { x X| d(p1,x)≤d(p2,x) } • { x X| d(p1,x) >d(p2,x) } MUFIN: Multi Feature Indexing Network

  28. 2r dm dm p p Excluded Middle Partitioning • Inner set: { x X| d(p,x)≤ dm -  } • Outer set: { x X| d(p,x) > dm + } • Excluded set: otherwise MUFIN: Multi Feature Indexing Network

  29. Clustering • Cluster data into sets • bounded by a ball region • { x X| d(pi,x)≤ ric } MUFIN: Multi Feature Indexing Network

  30. Scalability: Peer-to-Peer Indexing • Local search: M-tree, D-Index, M-Index • Native metric techniques: GHT*, VPT* • Transformation techniques: M-CAN, M-Chord MUFIN: Multi Feature Indexing Network

  31. The M-tree [Ciaccia, Patella, Zezula, VLDB 1997] 1) Paged organization 2) Dynamic 3) Suitable for arbitrary metric spaces 4) I/O and CPU optimization - computing d can be time-consuming MUFIN: Multi Feature Indexing Network

  32. C A B A E B C D E F D F quadratic form L1 (city-block) weighted-Euclidean L (max-metric) The M-tree Idea Metric: L2 (Euclidean) • Depending on the metric, the “shape” of index regions changes MUFIN: Multi Feature Indexing Network

  33. o3 o2 o9 o1 o4 o6 o11 o5 o10 o7 o8 1.0 1.3 1.2 0.0 0.0 1.4 2.9 0.0 0.0 0.0 1.6 o10 o7 o7 o2 o10 o4 o1 o2 o1 1.6 1.3 1.4 4.5 6.9 1.2 2.9 -.- -.- 0.0 5.3 3.8 0.0 3.3 o2 M-tree: Example o5 o11 o3 o8 o1 Covering radius o6 o4 o9 Distance to parent Distance to parent Distance to parent Leaf entries Distance to parent MUFIN: Multi Feature Indexing Network

  34. M-tree family • Bulk loading • Slim-tree • Multi-way insertion • PM-tree • M2-tree • etc. MUFIN: Multi Feature Indexing Network

  35. D-Index [Dohnal, Gennaro, Zezula, MTA 2002] 4 separable buckets at the first level 2 separable buckets at the second level exclusion bucket of the whole structure MUFIN: Multi Feature Indexing Network

  36. D-index: Insertion MUFIN: Multi Feature Indexing Network

  37. r r r r r r q q q q q q D-index: Range Search MUFIN: Multi Feature Indexing Network

  38. Implementation Postulates of Distributed Indexes • dynamism– nodes can be added and removed • no hot-spots – no centralized nodes, no flooding by messages (transactions) • update independence – network update at one site does not require an immediate change propagation to all the other sites MUFIN: Multi Feature Indexing Network

  39. DistributedSimilarity Search Structures • Native metric structures: • GHT* (Generalized Hyperplane Tree) • VPT* (Vantage Point Tree) • Transformation approaches: • M-CAN (Metric Content Addressable Network) • M-Chord (Metric Chord) MUFIN: Multi Feature Indexing Network

  40. p5 p2 p1 p5 p3 p2 p6 p4 p6 p3 p1 p4 GHT* Address Search Tree • Based on the Generalized Hyperplane Tree [Uhl91] • two pivots for binary partitioning MUFIN: Multi Feature Indexing Network

  41. p1 p5 p3 p2 p6 p4 BID1 BID2 BID3 NNID2 Peer 2 GHT* Address Search Tree • Inner node • two pivots (reference objects) • Leaf node • BID pointer to a bucket if data stored on the current peer • NNID pointer to a peer if data stored on a different peer MUFIN: Multi Feature Indexing Network

  42. Peer 1 Peer 3 Peer 2 GHT* Address Search Tree MUFIN: Multi Feature Indexing Network

  43. q p2 p1 p3 p5 p1 p5 p2 p6 p4 p2 p6 p5 BID1 BID2 BID3 BID3 NNID2 NNID2 p6 r p1 Peer 2 Peer 2 GHT* Range Query • Range query R(q,r) • traverse peer’s own AST • search buckets for all BIDs found • forward query to all NNIDs found p3 p4 MUFIN: Multi Feature Indexing Network

  44. p1 p1 p2 p2 p3 p3 p4 p4 p5 p6 p7 p7 p8 p8 p9 p10 p11 p12 p13 p14 NNID3 NNID3 BID1 BID1 NNID2 NNID4 NNID5 NNID5 NNID6 NNID7 NNID8 AST: Logarithmic replication • Full AST on every peer is space consuming • replication of pivots grows in a linear way • Store only a part of the AST: • all paths to local buckets • Deleted sub-trees: • replaced by NNIDof the leftmost peer MUFIN: Multi Feature Indexing Network

  45. p1 p2 p3 p4 p7 p8 BID1 AST: Logarithmic Replication (cont.) • Resulting tree • replication of pivots grows in a logarithmic way p1 p2 p3 p4 NNID5 p7 p8 NNID3 BID1 NNID2 MUFIN: Multi Feature Indexing Network

  46. p1 (r1) r1 p2 (r2) p3 (r3) r3 r2 p1 p3 p2 VPT* Structure • Similar totheGHT* - ball partitioning is used for AST Based on theVantage Point Tree [Yia93] • inner nodes have one pivot and a radius • different traversing conditions MUFIN: Multi Feature Indexing Network

  47. M-Chord: The Metric Chord • Transform metric space to one-dimensional domain • Use M-Index -a generalized version of theiDistance • Divide the domain into intervals • assign each interval to a peer • Use the Chord P2P protocol for navigation • The Skip graphs distributed protocol can be used, alternatively MUFIN: Multi Feature Indexing Network

  48. M-Chord: Indexing the Distance • iDistance – indexing technique for vector domains • cluster analysis = centers = reference pointspi • assign iDistancekeys to objects • range query R(q,r): identify intervals of interest • Generalization to metric spaces • select pivots • then partition: Voronoi-style MUFIN: Multi Feature Indexing Network

  49. M-Chord: Chord Protocol • Peer-to-Peer navigation protocol • Peers are responsible for intervals of keys • hops to localize a node storing a key • M-Chord • set the iDistancedomain • make it uniform: function h • Use Chord on this domain MUFIN: Multi Feature Indexing Network

  50. M-Chord: Range Query • Node Nq initiates thesearch • Determine intervals • generalized iDistance • Forward requests to peers on intervals • Search in the nodes • using local organization • Merge the received partial answers MUFIN: Multi Feature Indexing Network

More Related