1 / 28

Variations of Forward-SBNDM

Variations of Forward-SBNDM. Hannu Peltola Jorma Tarhio Aalto University Finland. Aims. Tuning algorithms for exact string matching. Studying the effect of simultaneous 2-byte read. SBNDM Simple Backward Nondeterministic DAWG Matching.

ghada
Télécharger la présentation

Variations of Forward-SBNDM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variations of Forward-SBNDM Hannu Peltola Jorma Tarhio Aalto University Finland

  2. Aims • Tuning algorithms for exact string matching. • Studying the effect of simultaneous 2-byte read. Aug. 29, 2011

  3. SBNDMSimple Backward Nondeterministic DAWG Matching • SBNDM [18] is a simplification of BNDM [17].Both are bit-parallel algorithms. • Text T = t1...tn, pattern P = p1...pm. • At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found. Aug. 29, 2011

  4. Shift of SBNDM • No factor: m • P found: 1 • Else: next alignment starts at the last factor Aug. 29, 2011

  5. SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbananaa na ana Aug. 29, 2011

  6. SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbananaa na ananot a factor: tananext alignment: antanabadbanana Aug. 29, 2011

  7. SBNDM, example P = banana, T = antanabadbanana...alignment: antanabadbanana a na ananot a factor:tananext alignment: antanabadbanana not a factor:dnext alignment:antanabadbanana Aug. 29, 2011

  8. SBNDMq • SBNDMq [6] is a tuned version of SBNDM. • Processing of an alignment starts with checking a q-gram. • Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana,only tana is tested. • Testing is done in a fast loop. Aug. 29, 2011

  9. Forward-SBNDM • Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2. • Both FSB and SBNDM2 read a 2-gram x1x2 before a factor test. • x1x2 is matched with the end of P in SBNDM2. • Only x1 is matched with the end of P in FSB, and x2 is a lookahead character following the current alignment. • FSB is faster than SBNDM2 for large alphabets. Aug. 29, 2011

  10. Generalization of FSB: FSB(q,f) • FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1, ..., q-1. • FSB(2,1) = FSB and FSB(q,0) = SBNDMq. • Motivation: SBNDMq works well on modern processors also for q>2. Aug. 29, 2011

  11. FSB(q,f) • Let UV be a q-gram, where |V| = f. • Afterreading UV there are 3 alternatives: • If U is a suffix of P, reading continues leftwards. • Else if UV is a factor of P, reading continues leftwards. • Else the state vector is zero and P is shifted m-q+f+1 positions (f positions more than in SBNDMq). Aug. 29, 2011

  12. Occurrence vectors in FSB(q,2) • Example: P = banana bananaSBNDMq: B[n] = 00001010FSB(q,2): B[n] = 00101011 B[a] = 01010111B[x] = 00000011 extra bits Aug. 29, 2011

  13. State vectors in FSB(q,2) for q=4 4-gram nanx: x 00000011 n 00101011 a 01010111n 00101011 00001000 4-gram State vector Conclusionnanx00001000 na is a suffix of Pxana00000000not a factoranan01000000 factor of P nanx is not a factor Aug. 29, 2011

  14. Benefits / drawbacks of lookahead characters and extra bits • Benefits • Longer shifts  more speed • Combined suffix/factor test • Drawback • More q-grams accepted  less speed Aug. 29, 2011

  15. Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2) • Factor tests of two 2-grams are done in one round. • Let B2[x,y] denote the combined occurrence vector of characters x and y. B2[x,y] = B[x] & (B[y]<<1) next: D  B2[ti,ti+1] if D = 0 then if B2[ti+m-1,ti+m] = 0 then i  i+2*m-2 goto next Aug. 29, 2011

  16. 2-byte read • Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop). • Suits well q-gram algorithms with even q. • For experiments we made two versions of the algorithms: • Standard (1-byte read) • b-version using 2-byte read Aug. 29, 2011

  17. 2-byte read (cont.) • Advantage: a part of computation can moved to preprocessing phase • Example: B2[x,y] = B[x] & (B[y]<<1) • Speed-up factor even more than 2 • Drawback: extra 0.1 ms for preprocessing. Aug. 29, 2011

  18. 4-byte read? • Many border crosses happen => slow down • 232 tables too big for practice Aug. 29, 2011

  19. Experimental results/KJV Bible • In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation (2010), the algorithms EBOM and Hash3 were the fastest in the bible text for m = 4,...,20. Aug. 29, 2011

  20. KJV: EBOM & Hash3 (on ThinkPad X61s) Aug. 29, 2011

  21. KJV: EBOMb & Hash3b (with 2-byte read) added Aug. 29, 2011

  22. KJV: SBNDM2b = FSB(2,0)b added Aug. 29, 2011

  23. KJV: GSB2b added Aug. 29, 2011

  24. KJV: FSB(4,i)b added, i = 0,1,2 Aug. 29, 2011

  25. KJV: Speed-up factors of 2-byte read GSB2 1.32 FSB(2,0) 1.34 FSB(2,1) 1.24 FSB(4,0) 1.72 FSB(4,1) 2.15 FSB(4,2) 2.03 Hash3 1.05 EBOM 1.17 Aug. 29, 2011

  26. Other experiments • DNA and binary data was also tested. • Gain of lookahead characters or the greedy loop was smaller than with the bible data. • Gain of 2-byte read was smaller with 64-bit code than with 32-bit code. Aug. 29, 2011

  27. Conclusions • Two new algorithms were presented: • FSB(q,f) • GSB2 • The new algorithms are faster than earlier algorithms on English data: • GSB2 for m = 4, …, 8 • FSB(q,f) for m = 8, …, 20 • 2-byte read makes most string algorithms faster. Aug. 29, 2011

  28. Web site for practical speed comparison cse.aalto.fi/stringmatching Aug. 29, 2011

More Related