390 likes | 525 Vues
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases. O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03) , pp. 359-366. Washington, DC. March 2003. Overview. Applications of queries Background on queries
E N D
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics and Bioengineering (BIBE '03), pp. 359-366. Washington, DC. March 2003.
Overview • Applications of queries • Background on queries • Current problem • Solutions and our solution • Comparison experiments and results • Future work BMI 731 - Winter'04
Queries in general • We need a metric distance function • To measure the (dis)similarity btw objects • Dynamic programming Algorithm • O( |string1| * |string2| ) time and space • i.e. O(n2) where n is length of the strings • Especially bad for genetic sequence queries where you have long sequences BMI 731 - Winter'04
2 kinds of queries • -range queries • Retrieve all objects similar to query more than a certain degree BMI 731 - Winter'04
2 kinds of queries k-nearest neighbor (k-NN) queries • Retrieve k most similar objects • No domain knowledge necessary Ex: 4 NN BMI 731 - Winter'04
2 kinds of queries • -range queries • Requires domain knowledge • Data distribution & Distance definition • too small None returned BMI 731 - Winter'04
2 kinds of queries • -range queries • too large All returned BMI 731 - Winter'04
Measuring similarity • We need a metric distance function • To measure the (dis)similarity btw objects • Edit Distance (ED) • Three kinds of operations • Insert, delete, replace • ACTTAGC to AATGATAG • A C T - - T A G C • R I I D ED = 4 • A A T G A T A G - • Dynamic programming Algorithm • O(mn) time and space BMI 731 - Winter'04
DPA BMI 731 - Winter'04
String/Genome Data • Asks the most similar substrings in the database to the given string. • BLAST has -range queries • Naïve search (linear scan) • scalability problems • How to Handle Size • Partial information rather than whole database • Approximate the string data (compress) may fit in memory may be used for indexing, clustering BMI 731 - Winter'04
How to Handle Size • 3 approaches to make use of compressed data • Prune irrelevant data, I/O for non-pruned entries calculate exact values for non-pruned (especially -range queries) • Get approximate answers, virtually no I/O (I/O only for answers)(especially k-NN queries) • Approximate pruning for -range queries BMI 731 - Winter'04
Overview • Background on queries • Current problem • Transformation and Indexing • Comparison experiments and results • Future work BMI 731 - Winter'04
Big PictureGeneral Approach step by step • Transform (large) string data into (hopefully smaller sized) multi-dimensional vectors • Develop a distance function df in vector spaces to approximate the string similarity • Build a multi-dimensional indexing technique on top of multi-dimensional vectors -Preprocessing- • Implement one of the three approaches mentioned -Query- BMI 731 - Winter'04
Preprocessing 1 Windowing Overlapping Windows String Database 2 Transformation Into vector Space 3 Indexing Indexed with respect to some distance function Multidimentional Vectors BMI 731 - Winter'04
Using the index Done 2a Approximate Query (k-NN or -range) The vectors returned represent most of k-NN (or vectors in -range ) + some false positives 1 Index of vectors Transformation 2b Exact Query (k-NN or -range) Candidate set Query sequence Index of vectors Continued BMI 731 - Winter'04
Using the index Calculate ED for each of them. (Remove false positives.) 3 I/O for strings represented by those vectors. Refine Candidate set BMI 731 - Winter'04
1ST Step: Partitioning into overlapping Windows • AACCGGTTACGTACGT… • AACCGGTTACGTACGT… • AACCGGTTACGTACGT… e.g W=6 e.g =2 BMI 731 - Winter'04
2ND Step: Mapping Windows into Vector Space • Choose a tuple size k • Associate an int to each 4k k-tuples • Frequencies of those k-tuples, is the vector • If k=2 4k=16 k-tuples • AA, AC, AG, AT, • CA, CC, CG, CT • TA, TC, TG, TT • GA, GC, GG, GT BMI 731 - Winter'04
Example Mapping • The integers assigned • AA=0, AC=1, AG=2, AT=3, • CA=4, CC=5, CG=6, CT=7 • TA=8, TC=9, TG=10, TT=11 • GA=12, GC=13, GG=14, GT=15 • Assume window AACCGG • AA, AC, CC, CG, GG all occur once • 1100011000100000 is the matching vector. BMI 731 - Winter'04
Different transformations & Distance Functions • Tuple size transformation size • 1 4 (frequencies of A, C, G, T) FV1 • 2 16 (frequencies of 2-tuples) FV2 BMI 731 - Winter'04
WVn transformation String into halves x,y FVns for x,yFVx,FVy Concatenate addition and subtraction of them [ FVx + FVy, FVx-FVy] Wavelet 1 on example TCACTTAG 1st: divide into halves & find FV1 transformation x:TCAC 1 2 0 1 y:TTAG 1 0 1 2 2nd: add and subtract 2 2 1 3 0 2 –1 –1 WV1 Same operations on 2-tuples WV2 Different transformations & Distance Functions 2 BMI 731 - Winter'04
Distance Functions on the Vector Spaces • All of them are proved to be lower-bounds to edit-distance • FD1 distance on FV1 • FD2 distance on FV2 • WD1 distance on WV1 • WD2 distance on WV2 BMI 731 - Winter'04
FDn (n-gram frequencies u,v) posDist:=negDist:=0 for all dimensions ui,vi If ui>vi then posDist:=ui-vi else negDist:=ui-vi Return max(posDist, negDist)/n u:ACTTAGC2,2,1,2 v:AATGATAG4,0,2,2 – 2-4<0 negDist+=|2-4| 2-0>0 posDist+=|2-0| 1-2<0 negDist+=|1-2| 2-2=0 posDist:2 negDist:3 FD1 is 3 Frequency Distance FDnAlgorithmExample (n=1) BMI 731 - Winter'04
FDn Why lower bound? • On example • need to incresase A by 2 G by 1 3 • need to decrease c by 2 • We may “increase+decrease” if we can replace (back to slide #8) • So in best case edit dist is only FD1 • But it may not be the case, you may need more operations, because of mismatch of locations… • Divide by n is because a change in one character, updates frequency of n n-grams. BMI 731 - Winter'04
WDn (n-gram frequency wavelets u,v) Find posDist and negDist on u,v m:=min(posDist, negDist) d:= (posDist-negDist)/2 if m < d Return d / n else Return (d + (m-d )/2 )/n u:ACTC TAGC 1201 1111 2 3 1 2 0 1 –1 0 v:AATG ATAG 2011 2011 4 0 2 2 0 0 0 0 posDist: 3 + 1 = 4 negDist: 2 + 1 + 1 = 4 m:4 d:0 (0 + 4/2)/1 Return 2 Wavelet Distance WDnAlgorithmExample (n=1) BMI 731 - Winter'04
WDn Why lower bound? • Assume a string transformed into wavelet [a1,…a, b1,…b] • Largest change posDist+=3 negDist-=1 or vice versa • So use this change whenever posDist<>negDist BMI 731 - Winter'04
Overview • Background on queries • Current problem • Transformation and Indexing • Comparison experiments and results • Future work BMI 731 - Winter'04
Experiment Design • Implemented transformations & distance functions • Evaluated their pruning efficiency on -rangequeries and approximation efficiency on k-NN queries experimentally on real genetic data • Ran queries with different parameters • Varying string size W, shift amount • Some containing exact match, some not • For -rangequeries different values • For k-NN queries different k values BMI 731 - Winter'04
Sorted Graphs • To depict why our distance functions perform so good in k-NN • Imitate what our k-NN approximation does, and graph the result • It sorts the data values in increasing order, and takes the k-nearest ones BMI 731 - Winter'04
50 nearest 20 nearest BMI 731 - Winter'04
50 nearest 20 nearest BMI 731 - Winter'04
Nature of the distance functions • WD2 has very good performance in k-NN even though not so well pruning • Its variance of its ratio to edit distance is much lower than others as you would like for a distance function BMI 731 - Winter'04
Results • Tested the parameters obtained by this random experiments, on real data. • Then also did the parameter extraction using real data too. BMI 731 - Winter'04
Comparison of index structures BMI 731 - Winter'04
Future Work • Check applicability of those methods to other kinds of sequence data. • Text • Image search • Implement index structure in the standalone program, and make performance evaluation BMI 731 - Winter'04