A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space • Presented By • Umang Shah • Koushik

Introduction • Sequential Scan always out perform whenever the dimension is greater then 10 or higher. • Any method of clustering or data space partition method fail to handle HDVS beyond a certain limit. • VA files is proposed to do the inevitable sequential scan more efficiently. Performance increases with dimensions.

Assumptions and Notation • Assumption 1-Data and Metric • Unit hypercube • Distances • Assumption 2-Uniformity and Independence • Data and query points are uniformly distributed • Dimensions are independent.

NN, NN-distance, NN-sphere

Probability and Volume Computations

The Difficulties of High Dimensionality • Number of partitions. • Data space is sparsely populated • Spherical range queries • Exponentially growing DB size • Expected NN-Distance.

Number of partitions • 2d partitions • Assume N = 106 points. • For d = 100, there are 2100 ≈ 1030 partitions. • Too many partitions are empty.

Data space is sparsely populated • 0.95^100 = 0.0059 • At d = 100, even a hypercube of side 0.95 can cover only 0.59% of the data space.

Spherical range queries • The largest spherical query.

Exponentially growing DB size • At least one point falls into the largest possible sphere.

Expected NN-Distance The NN distance grows steadily with d.

General Cost Model The Probability that the ith block is visited, If we assume m objects per block, Expected number of blocks visited Is Mvisit > 20%?

Space-Partitioning Methods Space consumption – 2d.So split is done in d’ dimensions only. is independent of d. E [nndist] increases with d When E [nndist] is greater than lmax the entire database is accessed.

Data-Partitioning Methods • Rectengular MBRS • R* tree,X tree,SR tree • Spherical MBRS • TV tree,M tree,SR tree • General partitioning and Clustering schemes

Rectangular MBRs

Spherical MBRS

General Partitioning and ClusteringSchemes • Assumptions • A cluster is characterized by a geometrical form (MBR) that covers all cluster points • Each cluster contains at least 2 points • The MBR of a cluster is convex.

Vector Approximation File Basic Idea: Technique Specially Designed For Similarity Search Object Approximation Vector Data Compression

Notations

Lower bound ,upper bound

How it is done • The data is divided in to 2^b rectangular cells • Cells are arranged in form of grid • Entire file is scanned at the time of query

Compression Vector • For each dimension a small number bits b [i] is assigned. • The sum b[i] is b • The data space is divided in 2^d hyper rectangles • Each data point is approximated by the bit string of the cell • Only the boundary points of each data set needs to be stored

Compression Vector • Normally bits chosen for each dimension vary from 4 to 8 • Typically bi = l, b = d *l, l = 4.. .8

Example:

Two probability associated with the VA files

Filtering Step • Simple Search Algorithm • An Array of k elements is maintained • This array is maintained in sorted order • File is sequentially searched. • If the element’s lower bound < k th element upper bound • The actual distance are calculated

Filtering Step • Near Optimal search algorithm • Done in two steps • While scanning through the file • Step1-Calculate  the kth largest upper bound Encountered so far If new element has lower bound greater then  then discard it

Filtering Step • Step2-The elements remaining in step1 are collected The elements in increasing order of lower bound are visited till it is >= to the kth element upper bound

Performance • Add Two Graphs Of Performance

Performance

Conclusion • All approaches to nearest-neighbor search in HDVSs ultimately become linear at high dimensionality. • The VA-File method can out-perform any other method known to the authors.

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

A Quantitative Analysis and Performance Study For Similar-Search Methods In High-Dimensional Space

Presentation Transcript

Quantitative Methods for Business

Quantitative Methods

Quantitative Methods of Data Analysis

Efficient Training in high-dimensional weight space

Quantitative Methods for Forensic Footwear Analysis

Qualitative and Quantitative Research Methods: the search for ‘the truth’

Quantitative Methods for Researchers

Quantitative Methods for Researchers

Multi Dimensional Direct Search Methods

High Performance Dimension Reduction and Visualization for Large High-dimensional Data Analysis

Quantitative Methods for Researchers

Quantitative Methods for Researchers

Quantitative Performance Analysis

Numerical Methods Multi Dimensional Direct Search Methods - Theory nm.mathforcollege

Numerical Methods Multi Dimensional Direct Search Methods - Example nm.mathforcollege

Quantitative Methods for Researchers

Exploration and Analysis of High-Dimensional Visual Feature Space

Particle Methods for High-Dimensional Traffic Estimation Problems

High Dimensional Data Analysis

QUANTITATIVE METHODS FOR MANAGERS

A Quantitative Study on Software Optimizations for Cache Performance

Visualization of High-Dimensional Space