Create Presentation
Download Presentation

Download Presentation
## Indexing and Data Mining in Multimedia Databases

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Indexing and Data Mining in Multimedia Databases**Christos Faloutsos CMU www.cs.cmu.edu/~christos**Outline**Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resources C. Faloutsos**Problem**Given a large collection of (multimedia) records, find similar/interesting things, ie: • Allow fast, approximate queries, and • Find rules/patterns C. Faloutsos**Sample queries**• Similarity search • Find pairs of branches with similar sales patterns • find medical cases similar to Smith's • Find pairs of sensor series that move in sync C. Faloutsos**Sample queries –cont’d**• Rule discovery • Clusters (of patients; of customers; ...) • Forecasting (total sales for next year?) • Outliers (eg., fraud detection) C. Faloutsos**Outline**Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses C. Faloutsos**Indexing - Multimedia**Problem: • given a set of (multimedia) objects, • find the ones similar to a desirable query object (quickly!) C. Faloutsos**$price**$price $price 1 1 1 365 365 365 day day day distance function: by expert C. Faloutsos**‘GEMINI’ - Pictorially**eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg off-the-shelf S.A.Ms (spatial Access Methods) 1 365 day C. Faloutsos**fast; ‘correct’ (=no false dismissals)**used for images (eg., QBIC) (2x, 10x faster) shapes (27x faster) video (eg., InforMedia) time sequences ([Rafiei+Mendelzon], ++) ‘GEMINI’ C. Faloutsos**Remaining issues**• how to extract features automatically? • how to merge similarity scores from different media C. Faloutsos**Outline**Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • Visualization: Fastmap • Relevance feedback: FALCON • Data Mining / Fractals • Conclusions C. Faloutsos**~100**~1 FastMap ?? C. Faloutsos**FastMap**• Multi-dimensional scaling (MDS) can do that, but in O(N**2) time • We want a linear algorithm: FastMap [SIGMOD95] C. Faloutsos**Applications: time sequences**• given n co-evolving time sequences • visualize them + find rules [ICDE00] DEM rate JPY HKD time C. Faloutsos**Applications - financial**• currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) C. Faloutsos**FRF**DEM HKD JPY USD GBP Applications - financial • currency exchange rates [ICDE00] USD(t) USD(t-5) C. Faloutsos**Outline**Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • Visualization: Fastmap • Relevance feedback: FALCON • Data Mining / Fractals • Conclusions C. Faloutsos**Merging similarity scores**• eg., video: text, color, motion, audio • weights change with the query! • solution 1: user specifies weights • solution 2: user gives examples • and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) • but: how about disjunctive queries? C. Faloutsos**DEMO**demo server C. Faloutsos**‘FALCON’**Vs Inverted Vs Trader wants only ‘unstable’ stocks C. Faloutsos**‘FALCON’**Vs Inverted Vs average: is flat! C. Faloutsos**“Single query point” methods**std + + + x + + + avg Rocchio C. Faloutsos**+**+ + + + + + + + + + + “Single query point” methods + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... C. Faloutsos**Main idea: FALCON Contours**[Wu+, vldb2000] + + feature2 eg., std + + + feature1 (eg., avg) C. Faloutsos**+**+ + + + A: Aggregate Dissimilarity • : parameter (~ -5 ~ ‘soft OR’) x g1 g2 C. Faloutsos**converges quickly (~5 iterations)**good precision/recall is fast (can use off-the-shelf ‘spatial/metric access methods’) FALCON C. Faloutsos**Conclusions for indexing + visualization**• GEMINI: fast indexing, exploiting off-the-shelf SAMs • FastMap: automatic feature extraction in O(N) time • FALCON: relevance feedback for disjunctive queries C. Faloutsos**Outline**Goal: ‘Find similar / interesting things’ • Problem - Applications • Indexing - similarity search • New tools for Data Mining: Fractals • Conclusions • Resourses C. Faloutsos**Data mining & fractals – Road map**• Motivation – problems / case study • Definition of fractals and power laws • Solutions to posed problems • More examples C. Faloutsos**Problem #1 - spatial d.m.**Galaxies (Sloan Digital Sky Survey w/ B. Nichol) • - ‘spiral’ and ‘elliptical’ galaxies • (stores & households; healthy & ill subjects) • - patterns? (not Gaussian; not uniform) • attraction/repulsion? • separability?? C. Faloutsos**Problem#2: dim. reduction**mpg • given attributes x1, ... xn • possibly, non-linearly correlated • drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) engine size C. Faloutsos**Answer:**• Fractals / self-similarities / power laws C. Faloutsos**What is a fractal?**= self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... C. Faloutsos**Definitions (cont’d)**• Paradox: Infinite perimeter ; Zero area! • ‘dimensionality’: between 1 and 2 • actually: Log(3)/Log(2) = 1.58… (long story) C. Faloutsos**Q: fractal dimension of a line?**Intrinsic (‘fractal’) dimension Eg: #cylinders; miles / gallon C. Faloutsos**Q: fractal dimension of a line?**A: nn ( <= r ) ~ r^1 Intrinsic (‘fractal’) dimension C. Faloutsos**Q: fractal dimension of a line?**A: nn ( <= r ) ~ r^1 Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) Intrinsic (‘fractal’) dimension C. Faloutsos**log(#pairs**within <=r ) 1.58 log( r ) Sierpinsky triangle == ‘correlation integral’ C. Faloutsos**Observations**self-similarity -> • <=> fractals • <=> scale-free • <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) C. Faloutsos**Road map**• Motivation – problems / case studies • Definition of fractals and power laws • Solutions to posed problems • More examples • Conclusions C. Faloutsos**Solution#1: spatial d.m.**Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) • clusters? • separable? • attraction/repulsion? • data ‘scrubbing’ – duplicates? C. Faloutsos**Solution#1: spatial d.m.**log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos**Solution#1: spatial d.m.**[w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos**r1**r2 r2 r1 spatial d.m. Heuristic on choosing # of clusters C. Faloutsos**Solution#1: spatial d.m.**log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) C. Faloutsos**Solution#1: spatial d.m.**log(#pairs within <=r ) • - 1.8 slope • - plateau! • repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) C. Faloutsos**Problem #2: Dim. reduction**C. Faloutsos**Solution:**• drop the attributes that don’t increase the ‘partial f.d.’ PFD • dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] C. Faloutsos**Problem #2: dim. reduction**global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 C. Faloutsos