470 likes | 489 Vues
NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms. Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine). Outline. Motivation: NN search NNH: Proposed histogram structure Main idea Utilizing NNH in a search (KNN, join) Constructing NNH
E N D
NNH: Improving Performance of Nearest-Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
Outline • Motivation: NN search • NNH: Proposed histogram structure • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance • Experiments
q NN-join: find the k nearest neighbors in the 2nd dataset for each object in the 1st dataset D1 D2 NN (nearest-neighbor) search KNN: find the k nearest neighbors of an object.
Example: image search Query image • Images represented as features (color histogram, texture moments, etc.) • Similarity search using these features • “Find 10 most similar images for the query image”
Other Applications • Web-page search • “Find 100 most similar pages for a given page” • Page represented as word-frequency vector • Similarity: vector distance • GIS: “find 5 closest cities of Irvine” • CAD, information retrieval, molecular biology, data cleansing, … • Challenges: Efficiency, Scalability
NN Algorithms • Distance measurement: • For objects are points, distance well defined • Usually Euclidean • Other distances possible • For arbitrary-shaped objects, assume we have a distance function between them • Most algorithms assume a high-dimensional tree structure exists for the datasets.
Example: R-Trees Take 2-d space as an example.
Minimal Bounding Rectangle • MBRis an n-dimensional rectangle that bounds its corresponding objects. • MBR face property: Every face of any MBR contains at least one point of some object
Search process (1-NN for example) • Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach • Keep a priority queue of nodes (mbr’s) to be visited • Sorted based on the “minimum distance” between q and each node • Improvement: • Use MINDIST and MINMAXDIST • Reduce the queue size • Avoid unnecessary disk IO’s to access MBR’s Priority queue
mbr1 q MINDIST mbr2 MINMAXDIST 2. Discard object o if dist(q,o) > MINIMAXDIST(q,mbr2) o q dist mbr2 MINMAXDIST q MINDIST mbr1 dist o Pruning in NN search 3. Discard mbr1 if MINDIST(q,mbr1) > disk(q,o) 1. Discard mbr1 if MINDIST(q,mbr1) > MINMAXDIST(q,mbr2)
Problem • Queue size may be large: • Example: 60,000, 32d (image) vectors, 50 NNs • Max queue size: 15K entries • Avg queue size: half (7.5K entries) • If queue can’t fit in memory, more disk IOs! • Problem worse for k-NN joins • E.g., 1500 x 1500 join: • Max queue size: 1.7M entries: >= 1GB memory! • 750 seconds to run • Couldn’t scale up to 2000 objects! • Disk thrashing
Our Solution: Nearest-Neighbor Histogram (NNH) • Main idea • Utilizing NNH in a search (KNN, join) • Constructing NNH • Incremental maintenance
NNH: Nearest-Neighbor Histograms pm p2 p1 m: # of pivots Distances of its nearest neighbors: r1, r2, …,
Main idea • Keep a histogram of NN distances of a pre-selected collection of objects (pivots). • They are not part of the database • They give a “big” picture of objects’ locations • Use the histogram to estimate the NN distance of each certain query object. • Use these estimated NN distances to do more pruning in an NN search
Structure • Nearest Neighbor Vectors: each ri is the distance of p’s i-th NN T: length of each vector • Nearest Neighbor Histogram • Collection of m pivots with their NN vectors
Estimate NN distance for query object • NNH does not give exact NN information for an object • But we can estimate an upper bound for the k-NN distance qest of q Triangle inequality
Estimate NN for query object(con’t) • Apply the triangle inequality to all pivots • Upper bound estimate of NN distance of q • Complexity: O(m)
Utilizing estimates in NN search • More pruning: prune an mbr if: q MINDIST mbr
Utilizing estimates in NN join • K-NN join: for each object o1 in D1, find its k-nearest neighbors in D2. • Preliminary algorithm by Hjaltason and Samet [HS98] • Traverse two trees top down; keep a queue of pairs
Utilizing estimates in NN join (cont’t) • Construct NNH for D2. • For each object o1 in D1, keep its estimated NN radius o1estusing NNH of D2. • Similar to k-NN query, ignore mbr for o1 if: MINDIST o1 mbr
Prune MBR pairs (cont) mbr1 mbr2 MINDIST Prune this MBR pair if:
How to construct an NNH? • If we have selected the mpivots: • Just run KNN queries for them to construct NNH • Time is O(m) • Offline • Important: selecting pivots • Size-Constraint NNH Construction • Error-Constraint NNH Construction
Size-constraint NNH construction • # of pivots “m” determines • Storage size • Initial construction cost • Incremental-maintenance cost • Choose m “best” pivots
Size-constraint NNH construction • Given m: # of pivots • Assuming: • query objects are from the database D • H(pi,k) doesn’t vary too much • Goal: Find pivots p1, p2, …, pm to minimize object distances to the pivots: • Clustering problem: • Many algorithms available • Use K-means for its simplicity and efficiency
Error-constraint NNH construction • Assumptions: • A threshold r is set apriori • Any estimate to the k-NN distance less than r is considered “good” enough. • I.e., a maximum error of r is tolerated for any distance estimate.
Error-constraint NNH construction (cont) • Find a set points S = {p1, p2, …, pm} from the dataset D • For each point pi, its kNN’s are within distance r/2 • Then, for any point q within distance r/2 from pi, we get a distance estimate for the KNN of q:
Error-constraint NNH construction (cont) • Problem: find points such that • They cover the entire data set with spheres of radius r/2 • The sum of distances of all points in each sphere to its center is minimized • An instance of the “k-center problem” • Efficient 2-approximation algorithm using a single pass over the dataset
Incremental Maintenance • How to update the NNH when inserting or deleting objects? • Need to “shift” each vector: • Associate a valid length Ei to each NN vector.
Insertion • Locate the position j in each NN vector where
Insertion (con’t) • If j not found, we don’t need to update this pivot NN vector (why?) • If found: • insert the new radius • shift the vector to the right • increment Ei by 1.
Deletion • Similar to the Insertion • Locate position of • If not found, no update for this vector • If found: • remove rj • shift the rest to the left • decrement Eiby 1
Experiments • Dataset: • Corel image database • Contains 60,000 images • Each image represented by a 32-dimensional float vector • Test bed: • PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. • GNU C++ in CYGWIN
Questions to be answered • Is the pruning using NNH estimates powerful? • KNN queries • NN-join queries • Is it “cheap” to have such a structure? • Storage • Initial construction • Incremental maintenance
Improvement in k-NN search • Run k-means algorithm to generate 400 pivots, and construct the NNH • Perform 10-NN queries on 100 randomly selected query objects. • Queue size as the benchmark for memory usage. • Max queue size • Average queue size
Improvement in k-NN joins • Selected two subsets from the Corel dataset. Each contains 1500 objects. • Unfortunately couldn’t run the PC due to large memory requirement • Ran on a SUN Ultra 4 workstation with four 300MHz CPU and 3GB Memory. • Constructed NNH (400 pivots) for D2.
Cost/Benefit of NNH For 60,000 32-d float vectors. “0” means almost zero.
Conclusion • NNH: efficient, effective approach to improving NN-search performance. • Can be easily embedded into current implementation of NN algorithms. • Can be efficiently constructed and maintained. • Offers substantial performance advantages.
Work conducted in the Flamingo Project on Data Cleansing at UC Irvine