250 likes | 408 Vues
This work by Ayat A. Dawood and Mohamed AbouelHoda presents advancements in enhanced suffix arrays, focusing on exact pattern matching and improved bucketing representation through minimal perfect hashing techniques. Our approach reduces space complexity while achieving constant access time for long common prefixes. These modifications lead to a significant performance increase for various applications, including genomic sequence analysis, effectively addressing the challenges posed by large datasets. We discuss the construction of suffix arrays and l-intervals, alongside the implementation details of our enhancements.
E N D
AyatA.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Fine Tuning the Enhanced Suffix Arrays Ayat A.Dawood
Table of Contents • Suffix array • The enhanced suffix array • Our accomplishment: • Minimal Perfect Hashing Function • The exact pattern matching problem • Improving the bucket table representation Ayat A.Dawood
Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood
Suffix array • Array of integers in the range from 0to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$. • e.g., S = acaaacatat$ Ayat A.Dawood
Enhanced suffix array • Basically it is the suffix array enhanced with a set of tables. • Using those tables, best performance and complexity are achieved • lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1]. Ayat A.Dawood
Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 1-[0..5] AyatA.Dawood
Enhanced suffix array: l-interval 1-[0..5] a 2-[0..1] • L-interval: interval of suffixes sharing the same prefix AyatA.Dawood
Enhanced suffix array: l-interval • L-interval: interval of suffixes sharing the same prefix 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Our accomplishment • Improvement (Fine Tuning): • Alphabet-independent exact pattern matching. • Improving bucket table representation • Improving access to the lcp-table. • Improvements are achieved using minimal perfect hashing techniques. Ayat A.Dawood
Minimal perfect hashing(MPHF) • Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al] • Look up table requires O(|U|) space to achieve constant access time Ayat A.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • e.g., pattern = aca 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] AyatA.Dawood
Exact pattern matching problem • Using normal method: takes O(nm) • Using the enhanced suffix arrays, it can be achieved in O(|∑|m) [AbouElHoda et. al] • Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)).[Kim et. al],[Fischer et. al] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] MPHF table c t a 1-[8..9] 1-[0..5] 2-[6..7] MPHF table a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Exact pattern matching problem • Our work: • Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor. 0-[0..10] c t a 1-[8..9] 1-[0..5] 2-[6..7] a c t 3-[2..3] 2-[0..1] 2-[4..5] Ayat A.Dawood
Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood
Improving the bucket table representation • Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array Ayat A.Dawood
Improving the bucket table representation cont’ • Problem: • Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|). • Solution: • Use minimal perfect hashing techniques to store the look up table. Ayat A.Dawood
Improving the bucket table representation cont’ • Results: • For the bacterial ecoli genome (size = 5400 bp) and for d= 12 *N for undefined nucleotide or dummy character Ayat A.Dawood
Conclusion • Exact pattern matching problem • Improving the bucket table representation. • Improving access to the lcp-table. Ayat A.Dawood
Questions??? Ayat A.Dawood
Improving access to the lcp-table • To reduce space, lcp- table is stored in 1 byte. • If a common prefix is longer than 255, then it is stored in another table. • To access this table, it is accessed sequential or using binary search • Our Enhancement: • Use MPHF to store the extra table to access it in constant time. lcp-table 0 Extra lcp-table 2 257 279 3 300 2 260 0 Ayat A.Dawood