Leila Kaghazian

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian

Similarity • Exact queries • Similarity • Identify companies with similar pattern of growth • Determine products with similar selling pattern • Discover stocks with similar movement in stock prices.

Similarity Queries • Whole Matching. The sequences to be compared have the same length n. • Range Query. Given a query sequences that are similar within distance “e”. • All-Pairs queries. Given N sequences, find the pairs of sequences that are within “e” of each other. • Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence.

Extracting Features from Sequences • For numerical sequences, extracting K features, mapping it to k-dimensional space and using multidimensional index methods (R*-tree, R-tree,grid-files,…) to store and search these points. • Completeness of feature extracting • Dimensionality “curse”

Discrete Fourier Transform • All periodic waves can be generated by combining Sin and Cos waves of different frequencies • Number of Frequencies may not be finite • Fourier Transform Decomposes a Periodic Wave into its Component Frequencies

DFT Concept I

DFT Concept II

DFT Characteristics • Completeness of feature extracting • Dimensionality curse • Parseval theorem gives that Euclidean distance between two signals x and y in the time domain is the same as their Euclidean distance in the frequency domain

Proposed Technique • Obtain the coefficients of DFT of each sequence in the database • Build a multidimensional index (F-index) using the first fc (<5)Fourier coefficients. • For a range query, obtain the first fc Fourier coefficients of the query. • For an all-pairs query, doing a spatial join using the F-index (superset of the answer set) • The actual answer set is obtained in a post-processing step

Euclidean distance features • Euclidean distance is useful in many cases • It can be used with any other type of similarity measure • Euclidean distance is the optimal distance measure of estimation if signals are corrupted by Gaussian additive noise • It is preserved under orthonormal transforms

DFT Characteristics • Preserves the distance • Is easy to compute • Concentrate the energy of the signal in few coefficients • It’s a orthonormal transform • The data dependent ones • + better performance • - expensive data reorganization if data set evolves over time • Data independent ones(DFT, DCT, wavelet)

Number of Fourier coefficients • Worst-case signal is White noise when xt is completely independent of its neighbors. • It has the same energy in every frequency means all frequency are equally important. This is bad for F-index. • Random walks (brown noise) • Stock movements and exchange rates • Primary and secondary trends correspond to strong, low frequency signals while minor trends corresponds to weak, high frequency signals

Performance Experiments • How to choose the number of Fourier coefficients to be retained (cut-off frequency fc) in the F-index method. • A larger fc • reduces the false hits • increases the search time. • How does the search time grow as a function of number of sequences in the database? • How does the length n of the sequences affect the performance?

Number of Fourier coefficients Range Queries All-Pairs Queries

Different Sequence Set Size All-Pairs Queries Range Queries

Varying Sequence Length All-Pairs Queries Range Queries

Discussion • The minimum execution time for both range and all-pairs queries is achieved for a small number of fc • Increasing the number of sequences in the database results in higher gains for this method • Increasing the length of the sequence n also results in higher gain for the method

Summary • Use DFT to extract sequence features • Only first few coefficient is strong enough • DFT is orthonotmal • Use R*-tree for indexing • Use Euclidean distance • Complexity is O(nlog(n))

Leila Kaghazian

Leila Kaghazian

Presentation Transcript

Ecole Leila North Community School

Leila Othman

Leila Aboulela

Caio Neves Leila Soriano

Santa and the Reindeer BY ; Ariel,Cate,Leila

Miriam Walker Leila Takayama Professor James Landay

Leila McKinnon Management

Welcome to Leila G. Davis Elementary School!

Christopher Keane Leila Gonzales 1 September 2010

Best TOWING in Leila North