Efficient Database Searching Using a Suffix Array and q-Gram Technique (QUASAR)

q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof E. Rivals P. Ferragina M. Vingron Max-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg

Outline • Existing Work • Motivation • Problem • Algorithm • Results

Existing Work • Examples : • BLAST • FASTA • Linear Scan (No Index) • Good Sensitivity

Motivation • Today: New Applications • Examples: • EST-Clustering • Large Scale Shotgun Assembly • Low Sensitivity • Multiple Searches • Specialized Algorithms Needed

Problem Definition Pattern P T C G A T T A C A G T G A A T Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T w = 8 • Local Alignment, minimum Length w • Low Error Rate (<10% Edit Distance)

The Algorithm • Filter Step: • Identify Hotspots • Scan Step: • Scan Hotspots with BLAST

The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting T C G C G A G A T A T T T T A T A C T C G A T T A C T C G A T T A C A G T G A A T q = 3 # of q-grams : |P| - q + 1 w = 8 G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams

The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 4 0 1 2 3 0 0 0 T C G A T T A C • Divide D into Blocks • Count matching q-grams per Block • Scan Blocks with counter ³ t How to find the matching q-grams? G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 3 23 11 16 • Precompute Searches for q-grams, O(1) Time Access • Sorted List of Pointers to Suffixes, O(log |D|) Access Time T C G A T T A C G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 3 0 2 1 4 0 0 7 q = 3 w = 8 e = 1 t = 3 T C G A T T A CA G T G A A T T C G A T T A C • Move Window over Query • Mark full Blocks for each Window • Scan Marked Blocks G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

Results • Influence of the Block Size • Sensitivity • Running Times • Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM

Results Influence of Block Size

Results Sensitivity • 1000 Queries • BLAST Cutoff E = 0.00001 • Number of identical hitlists • Mouse EST DB: 91.4 % • Human EST DB: 97.1 % • QUASAR finds many Hits below selected Error Level

Results Running Times • Test Parameters: • 6% Error • w = 50 • q = 11 • block size 2048 • scan with BLAST • time averaged for 1000 queries • ~30 times faster than BLAST

Results Overhead for Loading the Index • 1000 queries • Human EST DB, 280 Mbps • BLAST Test Run: • 5 seconds Load Time • 13.270 seconds Search Time • QUASAR Test Run: • 90 seconds Load Time • 380 seconds Search Time

Efficient Database Searching Using a Suffix Array and q-Gram Technique (QUASAR)