Content-Based Similarity Search

Content-Based Similarity Search Moses Charikar Princeton University Joint work with: Qin Lv, William Josephson, Zhe Wang, Perry Cook, Matthew Hoffman, Kai Li

Motivation • Massive amounts of feature-rich digital data • Audio, video, digital photos, scientific sensor data • Noisy, high-dimensional • Traditional file systems/search tools inadequate • Exact match • Keyword-based search • Annotations • Need content-based similarity search

Motivation • Recent progress of theoretical studies on sketches • compact data representation for estimation of pairwise similarity/distance • Compact data structures for high-quality and efficient content-basedsimilarity search?

Compact representation sketch complex object 0 1 0 1 1 0 0 1 1 0 • Distance measured by (weighted) ℓ1 distanced(x,y) = Σi wi·|xi-yi| • Better still, hamming distance between bit vectors • Distance between sketches estimates distance between objects • Several theoretical constructions of sketches forsets, vectors, earth mover distance (EMD). 0 0 1 0 1 1 0 0 1 0

Outline • Motivation • System architecture • Implementation details • Segmentation & feature extraction • Sketch construction • Filtering • Indexing • Performance evaluation • Conclusions & future work

System Architecture

Similarity Search Engine Architecture Pre-processing Query time

Similarity Search Problem • Similarity search: finding objects similar to a query object i.e. containing similar features • Object representation • Distance function d (X, Y) • Nearest neighbor query • K-nearest neighbor (KNN) • Approximate nearest neighbor (ANN)

0.2 0.3 0.2 0.1 0.4 0.2 0.2 0.1 0.2 0.4 0.2 0.1 0.2 0.1 0.1 Object Representation & Distance Function Earth Mover Distance (EMD)

Segmentation & Feature Extraction (1) • Derive a small set of features that characterize the important attributes of a data object • Data-dependent

Segmentation & Feature Extraction (1) • Image Data • JSEG image segmentation tool • Each segments by a 14-dimension feature vector • Color moments • First three moments in HSV color space  9-D vector • Bounding box • Aspect ratio, Bounding box size, Area ratio, Region centroid •  5-D vector • Segment weight  square root of segment size • ℓ1 distance between segments, EMD between images

Segmentation & Feature Extraction (2) • Audio Data • Phonetic segmentation & feature extraction using MARSYAS • Each segment • 50 sliding windows x 6 MFCC parameters = 300 • Segment weight  segment length • Segment distance: ℓ1 distance • Sentence distance: EMD

Segmentation & Feature Extraction (3) • 3D shape data • 32 decomposing spheres • Spherical harmonic descriptor (SHD) • Spherical harmonic coefficients up to order 16 • 32 x 17 = 544 dimensions • ℓ2 distance

x1 y1 x = (x1,x2,x3,x4) x2 y2 y = (y1,y2,y3,y4) y3 x3 x4 y4 0 1 Sketch Construction • Sketches: tiny data structures that can be used to estimate properties of original data • High-dimensional feature vector → NK bit vector • hamming distance  original feature vector distance • XOR groups of K bits → N bit vector • hamming distance  thresholded distance

Filtering for Similarity Search • EMD computation is expensive • Filtering • Scans through the entire dataset • Uses a much faster distance function to filter out “bad” answers • Computes EMD for a much smaller candidate set • Criteria in picking candidate objects • Has at least one segment that is close enough to one of the top segments of the query object

a leveled tree where each level is a “cover” for the level beneath it Nesting: Covering tree: For every node , there exists a node satisfying and exactly one such q is a parent of p Separation: For all nodes , Indexing for Similarity Search

Performance Evaluation • Can we achieve high-quality similarity search results at high speed? • How small can the sketches be as the metadata of the similarity search engine? • What are the performance tradeoffs of • Brute-force • Filtering • Indexing

Benchmarks • Search quality benchmark suite • VARY image: 10k images, 32 sets • TIMIT audio: 6300 sentences, 450 sets • PSB shape: 1814 models, 92 sets • Search speed benchmark suite • Mixed image dataset: 600k images • Mixed audio dataset: 60K sentences • Mixed shape dataset: 40k shape models

Search Quality Metrics Given a query q with k similar objects: • First-tier • Percentage of similar objects returned within rank k • Second-tier • Percentage of similar objects returned within rank 2k • Average precision

Search Quality & Search Speed

Search Quality vs. Sketch Size

Brute Force, Filtering, Indexing

Conclusions & Future Work • A general purpose content-based similarity search system • high-quality similarity search with reasonably high speed • Using sketches reduces metadata size • Filtering & indexing speeds up similarity search • Future work • More efficient distance function than EMD • Further investigation of indexing data structures • More data types: • video, genomic microarray data, other sensor data

Content-Based Similarity Search

Content-Based Similarity Search

Presentation Transcript

Seeds for Similarity Search

Geometry of Similarity Search

Similarity Search in Visual Data

Content Based Search

Distributed Spatio-Temporal Similarity Search

User Oriented Trajectory Similarity Search

Efﬁcient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition

Feature Based Similarity

Feature Based Similarity

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Distributed Spatio-Temporal Similarity Search

Database Similarity Search

Sequence Similarity Search: an Overview

Similarity Search for Web Services

Content-based Visual Search System

Connected Substructure Similarity Search

Community Support Based on Thematic Objects and Similarity Search

Similarity Search

Similarity based deduplication

Similarity Search: A Matching Based Approach

Operators for Similarity Search

Database Similarity Search