1 / 30

On Hyper-plane Partition of Distance-Based Indexing

On Hyper-plane Partition of Distance-Based Indexing. Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/22/2011. Outline. Similarity query and applications Distance-based indexing

kevyn
Télécharger la présentation

On Hyper-plane Partition of Distance-Based Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Hyper-plane Partition of Distance-Based Indexing Rui Mao National High Performance Computing Center at Shenzhen College of Computer Science and Software Engineering Shenzhen University, China 02/22/2011

  2. Outline • Similarity query and applications • Distance-based indexing • The Complete General Hyper-plane Tree • VPT: possibly the optimal hyper-plane • Conclusion and future work

  3. r q 1. Similarity Query & Applications Given • A database of n data records: S = {x1, x2, …,xn} • A similarity (distance) measure d(x,y) = the distance between data records x and y. • A query q Range query R(q,r) KNN query: (k-nearest neighbor) Google Map top 10 results

  4. Example 1 • Find all students with score in [75, 85]: SELECT name FROM student WHERE ABS(score-80)<=5;

  5. Example 2: Gas station near UT

  6. Molecular Biological Information System (MoBIoS) http://www.cs.utexas.edu/~mobios

  7. Image retrieval [CIT05]

  8. Conserved primer pair [ISMB04] Given: • Arabidopsis genome (120M) • Rice genome (537M) Goal: • determine a large number of paired, conserved DNA primers that may be used as primer pairs to PCR. Similarity: • Hamming distance of 18-mers

  9. Mass-spectra coarse filter [Bioinformatics06] Given: • A mass-spectra database • A query mass-spectra (high-dim vector) Goal: • A coarse filter, retrieval a small subset of database as candidate for fine filtering. Similarity • Semi-cosine distance

  10. Protein sequence homology [BIBE06] Given • A database of sequences • A query sequence Goal: • Local alignment Similarity: • Global alignment of 6-mers with mPAM matrix (weighted edit distance) Methodology • Break database and query into k-mers • Similarity query of k-mers • Chain the results.

  11. 2. Distance-based Indexing Indexing: • Goal: fast data lookup • Minimize number of distance calculations • Ideal case: Log or even constant time • Worst case: Sequential scan of database • Methodology: Partition and pruning

  12. Category: data type & similarity • Data type: One-dimensional, R Similarity measure: Euclidean norm (absolute value of difference) Index: One-dimensional indexing Example: B-tree • Data type: Multi-dimensional, Rn Similarity measure: Euclidean norm Index: Multi-dimensional indexing Example: kd-tree • Data type: Other type Similarity measure: Other measurement Index: ? Example: ?

  13. x d(x,z) d(x,y) y z d(y,z) Metric Space a pair, M=(D,d), where • D is a set of points • d is a [metric] distance function with the following: • d(x,y) = d (y,x) (symmetry) • d(x, y) >= 0 and d(x, y) = 0 iff x = y (non negativity) • d(x,z) <= d(x,y) + d(y,z) (triangle inequality)

  14. How it works? Range query R(snoppy,2) Advantages • Generality • One-dimensional data • Multi-dimensional data with Euclidean norm • Any metric space • A uniform programming model • the distance oracle is given • One index mechanism for most Disadvantages • Not fast enough ? 1 100 d(Michael,Linc)=1 d(Linc, Snoopy) = 100 99<=d(Michael, Snoopy)<= 101

  15. Data partition: three families • Hyper-plane methods • GHT [Uhlmann 1991] • GNAT [Brin 1995] • SA-tree [Navarro 1999] • Vantage point methods • BKT [Burkhard and Keller 1973] • VPT [Uhlmann 1991, Yianilos 1993] • MVPT [Bozkaya et al. 1997] • Bounding sphere methods • BST [Kalantari and McDonald 1983] • M-tree [Ciaccia et al. 1997] • Slim-tree [Traina et al. 2000]

  16. C1,C2 Right of L Left of L C1 C2 Hyper-plane methods [Uhlmann 1991] • Choose centers • Partition the data L

  17. VP1 VP1,R1 d(VP1, x)≤R1 R22 d(VP1, x)>R1 R21 VP21,R21 VP22,R22 d(VP22, x)≤R22 Case 1. If d(VP1,q) > R1 + r then search outside the sphere d(VP22, x)>R22 R1 VP21 … … d(VP1,q) VP22 r q Vantage Point Tree (VPT) [Uhlmann 1991 & Yianilos 1993] • Choose vantage points • Partition the data Case 2. If d(VP1,q) < R1 - r then search inside the sphere Case 3. Bad case: query object close to partition boundary, descend both children

  18. C1 C2 C3 Bounding sphere methods [Ciaccia et al. 1997] • Choose centers • Partition the data C1,R(C1) C3,R(C3) C2,R(C2)

  19. Difficulties and problems • No coordinates • Mathematical tools not directly applicable • Mostly heuristic • Lack of theoretical analysis • 3 families of indices • Not unified • Hard to compare, analyze and predict SISAP2010 Best Paper, with Dr. Miranker Focus of this talk, with Dr. Miranker

  20. General Methodology • metric space  Rk • multi-dimensional indexing  query cube • direct evaluation of cube

  21. P S The pivot space model Mapping: M  Rk :x  Pivot space: The image of S in Rk

  22. VP1 R1 p1 p2 Example of pivot space: VPT 22

  23. p1 p2 L 3. The Complete General Hyper-plane Tree (CGHT) GHT: metric space GHT: pivot space Partition by d1-d2 p1 p2 MVPT: pivot space CGHT: metric space CGHT: pivot space

  24. r-neighborhood Nr(L), the r-neighborhood of a partition boundary L, is the neighborhood of L in the pivot space, into which if a query object q falls, R(q,r) could have results in both sides of L. • Assuming q has the same distribution as the database,|Nr(L)|dominates query performance. • Width & Density

  25. Nr(L): |x-μ|≤ r 2r L: x = μ • Special case: L: x = μ • Width = 2r d(p2, x) q Nr(L): |y-x| ≤ 2r r y = x + 2r L: y = x y = x – 2r 2r 2r -2r 0 0 d(p1, x) d(p1, x) (b) Special case: L: y = x Width = Min width of r-neighborhood MVPT partition has the minimal width of r-neighborhood

  26. |Nr(L)|: analytical • 2-d normal dist.: N(0, 1, 0, 1, -ρ), 0≤ρ≤1 |NGHT(r)|∝ PGHT(r) = P(|x| ≤ r | x~N(0,1)) |NVPT(r)|∝ PVPT(r) = P(|x| ≤ r | x~N(0,1))

  27. |Nr(L)|: empirical

  28. Dimension rotation might not be helpful! • A counter example

  29. Conclusions and Future work Conclusions • Distance-based indexing is a very general approach • CGHT is an improvement on GHT • All the 3 families are hyper-plane partition in pivot space • VPT partition has the minimal width of r-neighborhood, and possibly minimal size of it. Future work • Multi-dimensional/statistical methods • Non-linear partition • Applications.

  30. Thank you! mao@szu.edu.cn http://nhpcc.szu.edu.cn/mao/eng

More Related