1 / 16

Efficient Database Searching Using a Suffix Array and q-Gram Technique (QUASAR)

This paper presents QUASAR, a novel algorithm for efficient database searching utilizing a suffix array and q-gram filtration. It addresses existing limitations in sensitivity and speed for biological sequence alignment and searching. The method precomputes searches for q-grams, allowing for rapid access and improved performance in large-scale applications like EST-clustering and shotgun assembly. Benchmarked against traditional tools like BLAST, QUASAR demonstrates significant speed improvements while maintaining high accuracy, making it a valuable tool for bioinformatics researchers.

naeva
Télécharger la présentation

Efficient Database Searching Using a Suffix Array and q-Gram Technique (QUASAR)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof E. Rivals P. Ferragina M. Vingron Max-Planck Institut f. Informatik, Saarbrücken Deutsches Krebsforschungszentrum, Heidelberg

  2. Outline • Existing Work • Motivation • Problem • Algorithm • Results

  3. Existing Work • Examples : • BLAST • FASTA • Linear Scan (No Index) • Good Sensitivity

  4. Motivation • Today: New Applications • Examples: • EST-Clustering • Large Scale Shotgun Assembly • Low Sensitivity • Multiple Searches • Specialized Algorithms Needed

  5. Problem Definition Pattern P T C G A T T A C A G T G A A T Database D G C A T T C G A T G G A C T G G A C T A G T G A A T C A G T w = 8 • Local Alignment, minimum Length w • Low Error Rate (<10% Edit Distance)

  6. The Algorithm • Filter Step: • Identify Hotspots • Scan Step: • Scan Hotspots with BLAST

  7. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting T C G C G A G A T A T T T T A T A C T C G A T T A C T C G A T T A C A G T G A A T q = 3 # of q-grams : |P| - q + 1 w = 8 G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T Edit Distance e : at least t = |P| - q + 1 - (qe) common q-grams

  8. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 4 0 1 2 3 0 0 0 T C G A T T A C • Divide D into Blocks • Count matching q-grams per Block • Scan Blocks with counter ³ t How to find the matching q-grams? G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

  9. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting AAA : 0 AAC : 0 AAG : 0 AAT : 0 ACA : 1 ACC : 1 ACG : 1 ACT : 1 AGA : 3 AGC : 3 AGG : 3 AGT : 3 ATA : 4 ATC : 4 ATG : 4 ATT : 5 TGA : 26 TGC : 27 TGG : 27 TGT : 29 TTA : 29 TTC : 29 TTG : 30 TTT : 30 3 23 11 16 • Precompute Searches for q-grams, O(1) Time Access • Sorted List of Pointers to Suffixes, O(log |D|) Access Time T C G A T T A C G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

  10. The Algorithm • q-gram Filtration • Block Addressing • Suffix Array • Window Shifting 3 0 2 1 4 0 0 7 q = 3 w = 8 e = 1 t = 3 T C G A T T A CA G T G A A T T C G A T T A C • Move Window over Query • Mark full Blocks for each Window • Scan Marked Blocks G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T

  11. Results • Influence of the Block Size • Sensitivity • Running Times • Overhead for loading the Index Benchmark System: Ultra Sparc Processor, 333Mhz, 4GB RAM

  12. Results Influence of Block Size

  13. Results Sensitivity • 1000 Queries • BLAST Cutoff E = 0.00001 • Number of identical hitlists • Mouse EST DB: 91.4 % • Human EST DB: 97.1 % • QUASAR finds many Hits below selected Error Level

  14. Results Running Times • Test Parameters: • 6% Error • w = 50 • q = 11 • block size 2048 • scan with BLAST • time averaged for 1000 queries • ~30 times faster than BLAST

  15. Results Overhead for Loading the Index • 1000 queries • Human EST DB, 280 Mbps • BLAST Test Run: • 5 seconds Load Time • 13.270 seconds Search Time • QUASAR Test Run: • 90 seconds Load Time • 380 seconds Search Time

More Related