1 / 16

q-gram Based Database Searching Using A Suffix Array (QUASAR)

q-gram Based Database Searching Using A Suffix Array (QUASAR). S. Burkhardt A. Crauser H-P. Lenhof. E. Rivals P. Ferragina M. Vingron. Max-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg. Outline. Existing Work Motivation Problem Algorithm

naeva
Télécharger la présentation

q-gram Based Database Searching Using A Suffix Array (QUASAR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof E. Rivals P. Ferragina M. Vingron Max-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg

  2. Outline • Existing Work • Motivation • Problem • Algorithm • Results

  3. Existing Work • Examples : • BLAST • FASTA • Linear Scan (No Index) • Good Sensitivity

  4. Motivation • Today: New Applications • Examples: • EST-Clustering • Large Scale Shotgun Assembly • Low Sensitivity • Multiple Searches • Specialized Algorithms Needed

  5. Problem Definition Pattern P T C G A T T A C A G T G A A T Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T w = 8 • Local Alignment, minimum Length w • Low Error Rate (<10% Edit Distance)

  6. The Algorithm • Filter Step: • Identify Hotspots • Scan Step: • Scan Hotspots with BLAST

  7. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting T C G C G A G A T A T T T T A T A C T C G A T T A C T C G A T T A C A G T G A A T q = 3 # of q-grams : |P| - q + 1 w = 8 G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams

  8. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 4 0 1 2 3 0 0 0 T C G A T T A C • Divide D into Blocks • Count matching q-grams per Block • Scan Blocks with counter ³ t How to find the matching q-grams? G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

  9. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 3 23 11 16 • Precompute Searches for q-grams, O(1) Time Access • Sorted List of Pointers to Suffixes, O(log |D|) Access Time T C G A T T A C G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

  10. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 3 0 2 1 4 0 0 7 q = 3 w = 8 e = 1 t = 3 T C G A T T A CA G T G A A T T C G A T T A C • Move Window over Query • Mark full Blocks for each Window • Scan Marked Blocks G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

  11. Results • Influence of the Block Size • Sensitivity • Running Times • Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM

  12. Results Influence of Block Size

  13. Results Sensitivity • 1000 Queries • BLAST Cutoff E = 0.00001 • Number of identical hitlists • Mouse EST DB: 91.4 % • Human EST DB: 97.1 % • QUASAR finds many Hits below selected Error Level

  14. Results Running Times • Test Parameters: • 6% Error • w = 50 • q = 11 • block size 2048 • scan with BLAST • time averaged for 1000 queries • ~30 times faster than BLAST

  15. Results Overhead for Loading the Index • 1000 queries • Human EST DB, 280 Mbps • BLAST Test Run: • 5 seconds Load Time • 13.270 seconds Search Time • QUASAR Test Run: • 90 seconds Load Time • 380 seconds Search Time

More Related