1 / 18

Leila Kaghazian

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami. Leila Kaghazian. Similarity. Exact queries Similarity Identify companies with similar pattern of growth Determine products with similar selling pattern

Télécharger la présentation

Leila Kaghazian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian

  2. Similarity • Exact queries • Similarity • Identify companies with similar pattern of growth • Determine products with similar selling pattern • Discover stocks with similar movement in stock prices.

  3. Similarity Queries • Whole Matching. The sequences to be compared have the same length n. • Range Query. Given a query sequences that are similar within distance “e”. • All-Pairs queries. Given N sequences, find the pairs of sequences that are within “e” of each other. • Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence.

  4. Extracting Features from Sequences • For numerical sequences, extracting K features, mapping it to k-dimensional space and using multidimensional index methods (R*-tree, R-tree,grid-files,…) to store and search these points. • Completeness of feature extracting • Dimensionality “curse”

  5. Discrete Fourier Transform • All periodic waves can be generated by combining Sin and Cos waves of different frequencies • Number of Frequencies may not be finite • Fourier Transform Decomposes a Periodic Wave into its Component Frequencies

  6. DFT Concept I

  7. DFT Concept II

  8. DFT Characteristics • Completeness of feature extracting • Dimensionality curse • Parseval theorem gives that Euclidean distance between two signals x and y in the time domain is the same as their Euclidean distance in the frequency domain

  9. Proposed Technique • Obtain the coefficients of DFT of each sequence in the database • Build a multidimensional index (F-index) using the first fc (<5)Fourier coefficients. • For a range query, obtain the first fc Fourier coefficients of the query. • For an all-pairs query, doing a spatial join using the F-index (superset of the answer set) • The actual answer set is obtained in a post-processing step

  10. Euclidean distance features • Euclidean distance is useful in many cases • It can be used with any other type of similarity measure • Euclidean distance is the optimal distance measure of estimation if signals are corrupted by Gaussian additive noise • It is preserved under orthonormal transforms

  11. DFT Characteristics • Preserves the distance • Is easy to compute • Concentrate the energy of the signal in few coefficients • It’s a orthonormal transform • The data dependent ones • + better performance • - expensive data reorganization if data set evolves over time • Data independent ones(DFT, DCT, wavelet)

  12. Number of Fourier coefficients • Worst-case signal is White noise when xt is completely independent of its neighbors. • It has the same energy in every frequency means all frequency are equally important. This is bad for F-index. • Random walks (brown noise) • Stock movements and exchange rates • Primary and secondary trends correspond to strong, low frequency signals while minor trends corresponds to weak, high frequency signals

  13. Performance Experiments • How to choose the number of Fourier coefficients to be retained (cut-off frequency fc) in the F-index method. • A larger fc • reduces the false hits • increases the search time. • How does the search time grow as a function of number of sequences in the database? • How does the length n of the sequences affect the performance?

  14. Number of Fourier coefficients Range Queries All-Pairs Queries

  15. Different Sequence Set Size All-Pairs Queries Range Queries

  16. Varying Sequence Length All-Pairs Queries Range Queries

  17. Discussion • The minimum execution time for both range and all-pairs queries is achieved for a small number of fc • Increasing the number of sequences in the database results in higher gains for this method • Increasing the length of the sequence n also results in higher gain for the method

  18. Summary • Use DFT to extract sequence features • Only first few coefficient is strong enough • DFT is orthonotmal • Use R*-tree for indexing • Use Euclidean distance • Complexity is O(nlog(n))

More Related