Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.

Overview • Applications of queries • Background on queries • Current problem • Solutions and our solution • Comparison experiments and results • Future work BMI 731 - Winter'04

Queries in general • We need a metric distance function • To measure the (dis)similarity btw objects • Dynamic programming Algorithm • O( |string1| * |string2| ) time and space • i.e. O(n2) where n is length of the strings • Especially bad for genetic sequence queries where you have long sequences BMI 731 - Winter'04

2 kinds of queries • -range queries • Retrieve all objects similar to query more than a certain degree   BMI 731 - Winter'04

2 kinds of queries k-nearest neighbor (k-NN) queries • Retrieve k most similar objects • No domain knowledge necessary Ex: 4 NN  BMI 731 - Winter'04

2 kinds of queries • -range queries • Requires domain knowledge • Data distribution & Distance definition  • too small None returned BMI 731 - Winter'04

2 kinds of queries • -range queries  • too large All returned BMI 731 - Winter'04

Measuring similarity • We need a metric distance function • To measure the (dis)similarity btw objects • Edit Distance (ED) • Three kinds of operations • Insert, delete, replace • ACTTAGC to AATGATAG • A C T - - T A G C • R I I D ED = 4 • A A T G A T A G - • Dynamic programming Algorithm • O(mn) time and space BMI 731 - Winter'04

DPA BMI 731 - Winter'04

String/Genome Data • Asks the most similar substrings in the database to the given string. • BLAST has -range queries • Naïve search (linear scan) • scalability problems • How to Handle Size • Partial information rather than whole database • Approximate the string data (compress)  may fit in memory  may be used for indexing, clustering BMI 731 - Winter'04

How to Handle Size • 3 approaches to make use of compressed data • Prune irrelevant data, I/O for non-pruned entries  calculate exact values for non-pruned (especially -range queries) • Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries) • Approximate pruning for -range queries BMI 731 - Winter'04

Overview • Background on queries • Current problem • Transformation and Indexing • Comparison experiments and results • Future work BMI 731 - Winter'04

Big PictureGeneral Approach step by step • Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors • Develop a distance function df in vector spaces to approximate the string similarity • Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- • Implement one of the three approaches mentioned -Query- BMI 731 - Winter'04

Preprocessing 1 Windowing Overlapping Windows String Database 2 Transformation Into vector Space 3 Indexing Indexed with respect to some distance function Multidimentional Vectors BMI 731 - Winter'04

Using the index Done 2a Approximate Query (k-NN or -range) The vectors returned represent most of k-NN (or vectors in -range ) + some false positives 1 Index of vectors Transformation 2b Exact Query (k-NN or -range) Candidate set Query sequence Index of vectors Continued BMI 731 - Winter'04

Using the index Calculate ED for each of them. (Remove false positives.) 3 I/O for strings represented by those vectors. Refine Candidate set BMI 731 - Winter'04

1ST Step: Partitioning into overlapping Windows • AACCGGTTACGTACGT… • AACCGGTTACGTACGT… • AACCGGTTACGTACGT… e.g W=6 e.g =2 BMI 731 - Winter'04

2ND Step: Mapping Windows into Vector Space • Choose a tuple size k • Associate an int to each 4k k-tuples • Frequencies of those k-tuples, is the vector • If k=2  4k=16 k-tuples • AA, AC, AG, AT, • CA, CC, CG, CT • TA, TC, TG, TT • GA, GC, GG, GT BMI 731 - Winter'04

Example Mapping • The integers assigned • AA=0, AC=1, AG=2, AT=3, • CA=4, CC=5, CG=6, CT=7 • TA=8, TC=9, TG=10, TT=11 • GA=12, GC=13, GG=14, GT=15 • Assume window AACCGG • AA, AC, CC, CG, GG all occur once • 1100011000100000 is the matching vector. BMI 731 - Winter'04

Different transformations & Distance Functions • Tuple size  transformation size • 1  4 (frequencies of A, C, G, T) FV1 • 2  16 (frequencies of 2-tuples) FV2 BMI 731 - Winter'04

WVn transformation String into halves x,y FVns for x,yFVx,FVy Concatenate addition and subtraction of them [ FVx + FVy, FVx-FVy] Wavelet 1 on example TCACTTAG 1st: divide into halves & find FV1 transformation x:TCAC  1 2 0 1 y:TTAG  1 0 1 2 2nd: add and subtract 2 2 1 3 0 2 –1 –1 WV1 Same operations on 2-tuples WV2 Different transformations & Distance Functions 2 BMI 731 - Winter'04

Distance Functions on the Vector Spaces • All of them are proved to be lower-bounds to edit-distance • FD1  distance on FV1 • FD2  distance on FV2 • WD1  distance on WV1 • WD2  distance on WV2 BMI 731 - Winter'04

FDn (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions ui,vi If ui>vi then posDist:=ui-vi else negDist:=ui-vi Return max(posDist, negDist)/n u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2 – 2-4<0 negDist+=|2-4| 2-0>0 posDist+=|2-0| 1-2<0 negDist+=|1-2| 2-2=0 posDist:2 negDist:3 FD1 is 3 Frequency Distance FDnAlgorithmExample (n=1) BMI 731 - Winter'04

FDn Why lower bound? • On example • need to incresase A by 2 G by 1 3 • need to decrease c by 2 • We may “increase+decrease” if we can replace (back to slide #8) • So in best case edit dist is only FD1 • But it may not be the case, you may need more operations, because of mismatch of locations… • Divide by n is because a change in one character, updates frequency of n n-grams. BMI 731 - Winter'04

WDn (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d Return d / n else Return (d + (m-d )/2 )/n u:ACTC TAGC 1201 1111 2 3 1 2 0 1 –1 0 v:AATG ATAG 2011 2011 4 0 2 2 0 0 0 0 posDist: 3 + 1 = 4 negDist: 2 + 1 + 1 = 4 m:4 d:0 (0 + 4/2)/1 Return 2 Wavelet Distance WDnAlgorithmExample (n=1) BMI 731 - Winter'04

WDn Why lower bound? • Assume a string transformed into wavelet [a1,…a, b1,…b] • Largest change posDist+=3 negDist-=1 or vice versa • So use this change whenever posDist<>negDist BMI 731 - Winter'04

Overview • Background on queries • Current problem • Transformation and Indexing • Comparison experiments and results • Future work BMI 731 - Winter'04

Experiment Design • Implemented transformations & distance functions • Evaluated their pruning efficiency on -rangequeries and approximation efficiency on k-NN queries experimentally on real genetic data • Ran queries with different parameters • Varying string size W, shift amount  • Some containing exact match, some not • For -rangequeries different  values • For k-NN queries different k values BMI 731 - Winter'04

BMI 731 - Winter'04

Sorted Graphs • To depict why our distance functions perform so good in k-NN • Imitate what our k-NN approximation does, and graph the result • It sorts the data values in increasing order, and takes the k-nearest ones BMI 731 - Winter'04

50 nearest 20 nearest BMI 731 - Winter'04

Nature of the distance functions • WD2 has very good performance in k-NN even though not so well pruning • Its variance of its ratio to edit distance is much lower than others as you would like for a distance function BMI 731 - Winter'04

BMI 731 - Winter'04

Results • Tested the parameters obtained by this random experiments, on real data. • Then also did the parameter extraction using real data too. BMI 731 - Winter'04

Comparison of index structures BMI 731 - Winter'04

Future Work • Check applicability of those methods to other kinds of sequence data. • Text • Image search • Implement index structure in the standalone program, and make performance evaluation BMI 731 - Winter'04

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Presentation Transcript

Seeds for Similarity Search

Fast Parallel Similarity Search in Multimedia Databases

Search for Approximate Matches in Large Databases

Indexing Methods for Faster and More Effective Person Name Search

Similarity Search in Protein Databases

Indexing similarity for efficient search in multimedia databases

Tree-based indexing methods for similarity search in metric and nonmetric spaces

A Pivotal Prefix Based Filtering Algorithm for String Similarity Search

Index-based approach to similarity search in protein and nucleotide databases

Indexing and Binning Large Databases

Spatial Indexing, Search, and Mapping for Species level databases

Similarity Search

MUFIN: Large-scale Similarity Search

Shape extraction framework for similarity search in image databases

Hierarchical Indexing Structure for Efficient Similarity Search in Video Retrieval

Fast Similarity Search in Image Databases

Query-driven search methods for large microarray databases

Biosequence Similarity Search on the Mercury System

Spatial Indexing and Visualizing Large Multi-dimensional Databases

Fast Similarity Search in Image Databases

Operators for Similarity Search

Effective Keyword Search in Relational Databases