250 likes | 376 Vues
This work addresses the challenges of content-based similarity search in high-dimensional, feature-rich digital data like audio, video, and scientific sensor data, where traditional search tools often fall short. We propose a system architecture that employs sketches—compact data representations that facilitate efficient pairwise distance estimation—allowing for effective similarity searches. Our framework includes mechanisms for data segmentation, feature extraction, indexing, and performance evaluation, ensuring both high-quality results and reasonable speed, paving the way for future advancements in similarity search techniques.
E N D
Content-Based Similarity Search Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li
Motivation • Massive amounts of feature-rich digital data • Audio, video, digital photos, scientific sensor data • Noisy, high-dimensional • Traditional file systems/search tools inadequate • Exact match • Keyword-based search • Annotations • Need content-based similarity search
Motivation • Recent progress of theoretical studies on sketches • compact data representation for estimation of pairwise similarity/distance • Compact data structures for high-quality and efficient content-basedsimilarity search?
Compact representation sketch complex object 0 1 0 1 1 0 0 1 1 0 • Distance measured by (weighted) ℓ1 distanced(x,y) = Σi wi·|xi-yi| • Better still, hamming distance between bit vectors • Distance between sketches estimates distance between objects • Several theoretical constructions of sketches forsets, vectors, earth mover distance (EMD). 0 0 1 0 1 1 0 0 1 0
Outline • Motivation • System architecture • Implementation details • Segmentation & feature extraction • Sketch construction • Filtering • Indexing • Performance evaluation • Conclusions & future work
Similarity Search Engine Architecture Pre-processing Query time
Similarity Search Problem • Similarity search: finding objects similar to a query object i.e. containing similar features • Object representation • Distance function d (X, Y) • Nearest neighbor query • K-nearest neighbor (KNN) • Approximate nearest neighbor (ANN)
0.2 0.3 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.4 0.2 0.1 0.2 0.1 0.1 Object Representation & Distance Function Earth Mover Distance (EMD)
Segmentation & Feature Extraction (1) • Derive a small set of features that characterize the important attributes of a data object • Data-dependent
Segmentation & Feature Extraction (1) • Image Data • JSEG image segmentation tool • Each segments by a 14-dimension feature vector • Color moments • First three moments in HSV color space 9-D vector • Bounding box • Aspect ratio, Bounding box size, Area ratio, Region centroid • 5-D vector • Segment weight square root of segment size • ℓ1 distance between segments, EMD between images
Segmentation & Feature Extraction (2) • Audio Data • Phonetic segmentation & feature extraction using MARSYAS • Each segment • 50 sliding windows x 6 MFCC parameters = 300 • Segment weight segment length • Segment distance: ℓ1 distance • Sentence distance: EMD
Segmentation & Feature Extraction (3) • 3D shape data • 32 decomposing spheres • Spherical harmonic descriptor (SHD) • Spherical harmonic coefficients up to order 16 • 32 x 17 = 544 dimensions • ℓ2 distance
x1 y1 x = (x1,x2,x3,x4) x2 y2 y = (y1,y2,y3,y4) y3 x3 x4 y4 0 1 Sketch Construction • Sketches: tiny data structures that can be used to estimate properties of original data • High-dimensional feature vector → NK bit vector • hamming distance original feature vector distance • XOR groups of K bits → N bit vector • hamming distance thresholded distance
Filtering for Similarity Search • EMD computation is expensive • Filtering • Scans through the entire dataset • Uses a much faster distance function to filter out “bad” answers • Computes EMD for a much smaller candidate set • Criteria in picking candidate objects • Has at least one segment that is close enough to one of the top segments of the query object
a leveled tree where each level is a “cover” for the level beneath it Nesting: Covering tree: For every node , there exists a node satisfying and exactly one such q is a parent of p Separation: For all nodes , Indexing for Similarity Search
Performance Evaluation • Can we achieve high-quality similarity search results at high speed? • How small can the sketches be as the metadata of the similarity search engine? • What are the performance tradeoffs of • Brute-force • Filtering • Indexing
Benchmarks • Search quality benchmark suite • VARY image: 10k images, 32 sets • TIMIT audio: 6300 sentences, 450 sets • PSB shape: 1814 models, 92 sets • Search speed benchmark suite • Mixed image dataset: 600k images • Mixed audio dataset: 60K sentences • Mixed shape dataset: 40k shape models
Search Quality Metrics Given a query q with k similar objects: • First-tier • Percentage of similar objects returned within rank k • Second-tier • Percentage of similar objects returned within rank 2k • Average precision
Conclusions & Future Work • A general purpose content-based similarity search system • high-quality similarity search with reasonably high speed • Using sketches reduces metadata size • Filtering & indexing speeds up similarity search • Future work • More efficient distance function than EMD • Further investigation of indexing data structures • More data types: • video, genomic microarray data, other sensor data