1 / 25

Content-Based Similarity Search

Content-Based Similarity Search. Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li. Motivation. Massive amounts of feature-rich digital data Audio, video, digital photos, scientific sensor data

hedva
Télécharger la présentation

Content-Based Similarity Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Content-Based Similarity Search Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li

  2. Motivation • Massive amounts of feature-rich digital data • Audio, video, digital photos, scientific sensor data • Noisy, high-dimensional • Traditional file systems/search tools inadequate • Exact match • Keyword-based search • Annotations • Need content-based similarity search

  3. Motivation • Recent progress of theoretical studies on sketches • compact data representation for estimation of pairwise similarity/distance • Compact data structures for high-quality and efficient content-basedsimilarity search?

  4. Compact representation sketch complex object 0 1 0 1 1 0 0 1 1 0 • Distance measured by (weighted) ℓ1 distanced(x,y) = Σi wi·|xi-yi| • Better still, hamming distance between bit vectors • Distance between sketches estimates distance between objects • Several theoretical constructions of sketches forsets, vectors, earth mover distance (EMD). 0 0 1 0 1 1 0 0 1 0

  5. Outline • Motivation • System architecture • Implementation details • Segmentation & feature extraction • Sketch construction • Filtering • Indexing • Performance evaluation • Conclusions & future work

  6. System Architecture

  7. Similarity Search Engine Architecture Pre-processing Query time

  8. Similarity Search Problem • Similarity search: finding objects similar to a query object i.e. containing similar features • Object representation • Distance function d (X, Y) • Nearest neighbor query • K-nearest neighbor (KNN) • Approximate nearest neighbor (ANN)

  9. 0.2 0.3 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.4 0.2 0.1 0.2 0.1 0.1 Object Representation & Distance Function Earth Mover Distance (EMD)

  10. Segmentation & Feature Extraction (1) • Derive a small set of features that characterize the important attributes of a data object • Data-dependent

  11. Segmentation & Feature Extraction (1) • Image Data • JSEG image segmentation tool • Each segments by a 14-dimension feature vector • Color moments • First three moments in HSV color space  9-D vector • Bounding box • Aspect ratio, Bounding box size, Area ratio, Region centroid •  5-D vector • Segment weight  square root of segment size • ℓ1 distance between segments, EMD between images

  12. Segmentation & Feature Extraction (2) • Audio Data • Phonetic segmentation & feature extraction using MARSYAS • Each segment • 50 sliding windows x 6 MFCC parameters = 300 • Segment weight  segment length • Segment distance: ℓ1 distance • Sentence distance: EMD

  13. Segmentation & Feature Extraction (3) • 3D shape data • 32 decomposing spheres • Spherical harmonic descriptor (SHD) • Spherical harmonic coefficients up to order 16 • 32 x 17 = 544 dimensions • ℓ2 distance

  14. x1 y1 x = (x1,x2,x3,x4) x2 y2 y = (y1,y2,y3,y4) y3 x3 x4 y4 0 1 Sketch Construction • Sketches: tiny data structures that can be used to estimate properties of original data • High-dimensional feature vector → NK bit vector • hamming distance  original feature vector distance • XOR groups of K bits → N bit vector • hamming distance  thresholded distance

  15. Filtering for Similarity Search • EMD computation is expensive • Filtering • Scans through the entire dataset • Uses a much faster distance function to filter out “bad” answers • Computes EMD for a much smaller candidate set • Criteria in picking candidate objects • Has at least one segment that is close enough to one of the top segments of the query object

  16. a leveled tree where each level is a “cover” for the level beneath it Nesting: Covering tree: For every node , there exists a node satisfying and exactly one such q is a parent of p Separation: For all nodes , Indexing for Similarity Search

  17. Performance Evaluation • Can we achieve high-quality similarity search results at high speed? • How small can the sketches be as the metadata of the similarity search engine? • What are the performance tradeoffs of • Brute-force • Filtering • Indexing

  18. Benchmarks • Search quality benchmark suite • VARY image: 10k images, 32 sets • TIMIT audio: 6300 sentences, 450 sets • PSB shape: 1814 models, 92 sets • Search speed benchmark suite • Mixed image dataset: 600k images • Mixed audio dataset: 60K sentences • Mixed shape dataset: 40k shape models

  19. Search Quality Metrics Given a query q with k similar objects: • First-tier • Percentage of similar objects returned within rank k • Second-tier • Percentage of similar objects returned within rank 2k • Average precision

  20. Search Quality & Search Speed

  21. Search Quality vs. Sketch Size

  22. Brute Force, Filtering, Indexing

  23. Conclusions & Future Work • A general purpose content-based similarity search system • high-quality similarity search with reasonably high speed • Using sketches reduces metadata size • Filtering & indexing speeds up similarity search • Future work • More efficient distance function than EMD • Further investigation of indexing data structures • More data types: • video, genomic microarray data, other sensor data

More Related