570 likes | 585 Vues
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases. V. Megalooikonomou Generic Multimedia Indexing (some slides are based on notes by C. Faloutsos). General Overview. Multimedia Indexing Spatial Access Methods (SAMs) k-d trees
E N D
CIS750 – Seminar in Advanced Topics in Computer ScienceAdvanced topics in databases – Multimedia Databases V. Megalooikonomou Generic Multimedia Indexing (some slides are based on notes by C. Faloutsos)
General Overview • Multimedia Indexing • Spatial Access Methods (SAMs) • k-d trees • Point Quadtrees • MX-Quadtree • z-ordering • R-trees • Generic Multimedia Indexing
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
Generic Multimedia Indexing - problem • Given a database of multimedia objects • Design fast search algorithms that locate objects that match a query object, exactly or approximately • Objects: • 1-d time sequences • Digitized voice or music • 2-d color images • 2-d or 3-d gray scale medical images • Video clips • E.g.: “Find companies whose stock prices move similarly”
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
Generic Multimedia Indexing- problem • 1st step: provide a measure for the distance between two objects • Distance function D(): • Given two objects OA, OB the distance (=dis-similarity) of the two objects is denoted by D(OA, OB) E.g., Euclidean distance (sum of squared differences) of two equal-length time series
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
Types of Similarity Queries std S1 F(S1) 1 365 day F(Sn) Sn avg day 1 365 • Similarity queries are classified into: • Whole match queries: • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q • Sub-pattern Match: • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
Types of Similarity Queries std S1 F(S1) 1 365 day F(Sn) • Similarity queries are classified into: • Whole match queries: • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q • Sub-pattern Match: • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q Sn avg day 1 365
Types of Similarity Queries std S1 F(S1) 1 365 day F(Sn) • Similarity queries are classified into: • Whole match queries: • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q • Sub-pattern Match: • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q Sn avg day 1 365
Types of Similarity Queries • Similarity queries are classified into: • Whole match queries: • Given a collection of N objects O1,…, ON and a query object Q find data objects that are within distance from Q • Sub-pattern Match: • Given a collection of N objects O1,…, ON and a query (sub-) object Q and a tolerance identify the parts of the data objects that match the query Q
Types of Similarity Queries std S1 F(S1) 1 365 day F(Sn) Sn avg day 1 365 • Additional types of queries: • K- Nearest Neighbor queries: • Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q • All pairs queries (or ‘spatial joins’): • Given a collection of N objects O1,…, ON find all objects that are within distance from each other
Types of Similarity Queries std S1 F(S1) 1 365 day F(Sn) Sn avg day 1 365 • Additional types of queries: • K- Nearest Neighbor queries: • Given a collection of N objects O1,…, ON and a query object Q find the K most similar data objects to Q • All pairs queries (or ‘spatial joins’): • Given a collection of N objects O1,…, ON find all objects that are within distance from each other
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
Idea method – requirements • Fast: sequential scanning and distance calculation with each and every object too slow for large databases • “Correct”: No false dismissals. False alarms are acceptable. Why? • Small space overhead • Dynamic: easy to insert, delete, and update objects
Approach Outline • Use k feature extraction functions to map objects into k-dimensional space (applying a mapping F () ) • Use highly fine-tuned database SAMs (Spatial Access Methods) like R-trees to accelerate the search (by pruning out large portions of the database that are not promising)…
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
Basic idea • Focus on ‘whole match’ queries • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q • Sequential scanning?
Basic idea • Focus on ‘whole match’ queries • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q • Sequential scanning? May be too slow.. Why?
Basic idea • Focus on ‘whole match’ queries • Given a collection of N objects O1,…, ON, a distance/dis-similarity function D(Oi, Oj), and a query object Q find data objects that are within distance from Q • Sequential scanning? May be too slow.. for the following reasons: • Distance computation is expensive (e.g., editing distance in DNA strings) • The Database size N may be huge • Faster alternative?
Basic idea • Faster alternative: • Step 1:a ‘quick and dirty’ test to discard quickly the vast majority of non-qualifying objects • Step 2: use of SAMs to achieve faster than sequential searching • Example: • Database of yearly stock price movements • Euclidean distance function • Characterize with a single number (‘feature’) • Or use two or more features
Basic idea - illustration Feature2 S1 F(S1) 1 365 day F(Sn) Sn Feature1 1 365 day • A query with tolerance becomes a sphere with radius
Basic idea – caution! • The mapping F() from objects to k-d points should not distort the distances • D(): distance of two objects • Df(): distance of their corresponding feature vectors • Ideally, perfect preservation of distances • In practice, a guarantee of no false dismissals • How?
Basic idea – caution! • The mapping F() from objects to k-d points should not distort the distances • D(): distance of two objects • Df(): distance of the corresponding feature vectors • Ideally, perfect preservation of distances • In practice, a guarantee of no false dismissals • How? If the distance in f-space matches or underestimates the distance between two objects in the original space
Basic idea – Lower bounding • Let O1, O2 be two objects with distance function D() and F(O1), F(O2), be their feature vectors with distance function Df(), then: To guarantee no false dismissals for whole match queries, the feature extraction function F() should satisfy: Df(F(O1), F(O2)) D(O1, O2) for every pair of objects O1, O2
Lower bounding - Proof • Let Q be the query object and O be the qualifying object and be the tolerance. • Prove: If object O qualifies it will be retrieved by a range query in the f-space • Or, D(Q, O) Df(F(Q), F(O)) • However, Df(F(Q), F(O)) D(Q, O) • What about ‘all-pairs’? • What about ‘nearest-neighbor’ queries?
Lower bounding - Proof • Let Q be the query object and O be the qualifying object and be the tolerance. • Prove: If object O qualifies it will be retrieved by a range query in the f-space • Or, D(Q, O) Df(F(Q), F(O)) • However, Df(F(Q), F(O)) D(Q, O) • What about ‘all-pairs’? (‘spatial join’ on f-space) • What about ‘nearest-neighbor’ queries?
Lower bounding - Proof • Let Q be the query object and O be the qualifying object and be the tolerance. • Prove: If object O qualifies it will be retrieved by a range query in the f-space • Or, D(Q, O) Df(F(Q), F(O)) • However, Df(F(Q), F(O)) D(Q, O) • What about ‘all-pairs’? (‘spatial join’ on f-space) • What about ‘nearest-neighbor’ queries? ??
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
GEneric Multimedia object INdexIng • GEMINI approach: • Determine distance function D() • Find one or more numerical feature-extraction functions (to provide a ‘quick and dirty’ test) • Prove that Df() lower-bounds D() to guarantee no false dismissals • Use a SAM (e.g., R-tree) to store and retrieve k-d feature vectors • !!! The methodology focuses on the speed of search only; not on the quality of the results which relies on the distance function
Generic Multimedia Object Indexing • Applications: • 1-d time sequences • 2-d color images • Problems to solve: • How to apply the lower-bounding lemma • ‘Curse of Dimensionality’ (time sequences) • ‘Cross-talk’ of features (color images)
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
1-D Time Sequences • Distance function: Euclidean distance • Find features that: • Preserve/lower-bound the distance • Carry as much information as possible(reduce false alarms) • If we are allowed to use only one feature what would this be?
1-D Time Sequences • Distance function: Euclidean distance • Find features that: • Preserve/lower-bound the distance • Carry as much information as possible(reduce false alarms) • If we are allowed to use only one feature what would this be? The average. • … extending it…
1-D Time Sequences • Distance function: Euclidean distance • Find features that: • Preserve/lower-bound the distance • Carry as much information as possible(reduce false alarms) • If we are allowed to use only one feature what would this be? The average. • … extending it… • The average of 1st half, of the 2nd half, of the 1st quarter, etc. • Coefficients of the Fourier transform (DFT), wavelet transform, etc.
1-D Time Sequences • Show that the distance in feature space lower-bounds the actual distance • What about DFT?
1-D Time Sequences • Show that the distance in feature space lower-bounds the actual distance • What about DFT? Parseval’s Theorem: DFT preserves the energy of the signal as well as the distances between two signals. D(x,y) = D(X,Y) where X and Y are the Fourier transforms of x and y • If we keep the first k n coefficients of DFT we lower-bound the actual distance
1-D Time Sequences • Response time improves as the transform concentrates more the energy of the signal • DFT concentrates the energy for a large class of signals, the colored noises • Colored noises: skewed energy spectrum that drops as O(f -b) • Energy spectrum or power spectrum of a signal is the square of the amplitude |Xf| as a function of the frequency f • b = 2: random walks or brown noise (very predictable) • b 2: black noises • b = 1: pink noise • b = 0: white noise (completely unpredictable) • Colored noises even in images (photographs)
Mutlimedia Indexing – Detailed outline • Generic Multimedia Indexing • problem dfn • Distance function • Similarity queries – Types • Requirements (ideal method) • Basic idea, Lower-bounding • Gemini approach • Applications • 1-D Time sequences • 2-D Color images
2-D color images • Image features for Content Based Image Retrieval (CBIR): • Low Level: • Color – color histograms • Texture – directionality, granularity, contrast • Shape – turning angle, moments of inertia, pattern spectrum • Position – 2D strings method • …etc • Object Level: • Regions
2-D color images – Color histograms • Each color image – a 2-d array of pixels • Each pixel – 3 color components (R,G,B) • h colors – each color denoting a point in 3-d color space (as high as 224 colors) • For each image compute the h-element color histogram – each component is the percentage of pixels that are most similar to that color • The histogram of image I is defined as: For a color Ci , Hci(I) represents the number of pixels of color Ci in image I OR: For any pixel in image I, Hci(I) represents the possibility of that pixel having color Ci.
2-D color images – Color histograms • Usually cluster similar colors together and choose one representative color for each ‘color bin’ • Most commercial CBIR systems include color histogram as one of the features (e.g., QBIC of IBM) • No space information
Color histograms - distance • One method to measure the distance between two histograms x and y is: where the color-to-color similarity matrix A has entries aij that describe the similarity between color i and color j
Color histograms – lower bounding • Two obstacles for using color-histograms as feature vectors in GEMINI: • ‘Dimensionality curse’ (h is large 64, 128) • Distance function is quadratic • It involves all cross terms (‘cross-talk’ among features) - expensive to compute - precludes the use of SAMs bright red pink orange x q e.g.,64 colors
Color histograms – lower bounding • 1st step: define the distance function between two color images D()=dh() • 2nd step: find numerical features (one or more) whose Euclidean distance lower-bounds dh() • If we allowed to use one numerical feature to describe the color image what should it be? • Avg. amount for each color component (R,G,B) • Where … , similarly for G and B Where P is the number of pixels in the image, R(p) is the red component (intensity) of the p-th pixel
Color histograms – lower bounding • Given the average color vectors and of two images we define davg() as the Euclidean distance between the 3-d average color vectors • 3rd step: to prove that the feature distance davg() lower-bounds the actual distance dh() • Main idea of approach: • First a filtering using the average (R,G,B) color, • then a more accurate matching using the full h-element histogram
Color auto-correlogram • pick any pixel p1 of color Ciin the image I • at distance k away from p1 pick another pixel p2 • what is the probability that p2 is also of color Ci ? Red ? k P2 P1 Image: I
Color auto-correlogram • The auto-correlogram of image I for color Ci , distance k: • Integrate both color information and space information.
Implementations • Pixel Distance Measures • Use D8 distance (also called chessboard distance): • Choose distance k=1,3,5,7 • Computation complexity: • Histogram: • Correlogram: