1 / 31

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space . Presented By Umang Shah Koushik. Introduction. Sequential Scan always out perform whenever the dimension is greater then 10 or higher.

duy
Télécharger la présentation

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space • Presented By • Umang Shah • Koushik

  2. Introduction • Sequential Scan always out perform whenever the dimension is greater then 10 or higher. • Any method of clustering or data space partition method fail to handle HDVS beyond a certain limit. • VA files is proposed to do the inevitable sequential scan more efficiently. Performance increases with dimensions.

  3. Assumptions and Notation • Assumption 1-Data and Metric • Unit hypercube • Distances • Assumption 2-Uniformity and Independence • Data and query points are uniformly distributed • Dimensions are independent.

  4. NN, NN-distance, NN-sphere

  5. Probability and Volume Computations

  6. The Difficulties of High Dimensionality • Number of partitions. • Data space is sparsely populated • Spherical range queries • Exponentially growing DB size • Expected NN-Distance.

  7. Number of partitions • 2d partitions • Assume N = 106 points. • For d = 100, there are 2100 ≈ 1030 partitions. • Too many partitions are empty.

  8. Data space is sparsely populated • 0.95^100 = 0.0059 • At d = 100, even a hypercube of side 0.95 can cover only 0.59% of the data space.

  9. Spherical range queries • The largest spherical query.

  10. Exponentially growing DB size • At least one point falls into the largest possible sphere.

  11. Expected NN-Distance The NN distance grows steadily with d.

  12. General Cost Model The Probability that the ith block is visited, If we assume m objects per block, Expected number of blocks visited Is Mvisit > 20%?

  13. Space-Partitioning Methods Space consumption – 2d.So split is done in d’ dimensions only. is independent of d. E [nndist] increases with d When E [nndist] is greater than lmax the entire database is accessed.

  14. Data-Partitioning Methods • Rectengular MBRS • R* tree,X tree,SR tree • Spherical MBRS • TV tree,M tree,SR tree • General partitioning and Clustering schemes

  15. Rectangular MBRs

  16. Spherical MBRS

  17. General Partitioning and ClusteringSchemes • Assumptions • A cluster is characterized by a geometrical form (MBR) that covers all cluster points • Each cluster contains at least 2 points • The MBR of a cluster is convex.

  18. Vector Approximation File Basic Idea: Technique Specially Designed For Similarity Search Object Approximation Vector Data Compression

  19. Notations

  20. Lower bound ,upper bound

  21. How it is done • The data is divided in to 2^b rectangular cells • Cells are arranged in form of grid • Entire file is scanned at the time of query

  22. Compression Vector • For each dimension a small number bits b [i] is assigned. • The sum b[i] is b • The data space is divided in 2^d hyper rectangles • Each data point is approximated by the bit string of the cell • Only the boundary points of each data set needs to be stored

  23. Compression Vector • Normally bits chosen for each dimension vary from 4 to 8 • Typically bi = l, b = d *l, l = 4.. .8

  24. Example:

  25. Two probability associated with the VA files

  26. Filtering Step • Simple Search Algorithm • An Array of k elements is maintained • This array is maintained in sorted order • File is sequentially searched. • If the element’s lower bound < k th element upper bound • The actual distance are calculated

  27. Filtering Step • Near Optimal search algorithm • Done in two steps • While scanning through the file • Step1-Calculate  the kth largest upper bound Encountered so far If new element has lower bound greater then  then discard it

  28. Filtering Step • Step2-The elements remaining in step1 are collected The elements in increasing order of lower bound are visited till it is >= to the kth element upper bound

  29. Performance • Add Two Graphs Of Performance

  30. Performance

  31. Conclusion • All approaches to nearest-neighbor search in HDVSs ultimately become linear at high dimensionality. • The VA-File method can out-perform any other method known to the authors.

More Related