240 likes | 436 Vues
An Improved Algorithm to Accelerate Regular Expression Evaluation. Michela Becchi and Patrick Crowley ANCS 2007. Context. Regular expression matching is a critical operation in networking Intrusion detection Context based billing Peer-to-peer traffic detection and prioritization
E N D
An Improved Algorithm to Accelerate Regular Expression Evaluation Michela Becchi and Patrick Crowley ANCS 2007
Context • Regular expression matching is a critical operation in networking • Intrusion detection • Context based billing • Peer-to-peer traffic detection and prioritization • Application level filtering • Challenge: perform regular expression matching at line rate, given data-sets of hundreds (or thousands) of patterns • Processing time • Memory requirement (occupancy and bandwidth)
Background & Problem definition • Two algorithmic solutions • Non deterministic finite automata (NFAs) • High time complexity/memory bandwidth requirements • Compact representation • Deterministic finite automata (DFAs) • Low time complexity • Potentially high storage requirement • Multiple implementation approaches • FPGA [Sidhu 2001, Clark 2003, Moscola 2003] • Software [Paxson 1998, Roesch 1999, Tuck 2004] • Custom hardware [Kumar 2006] • Problem: given a DFA, find a representation • compact • allowing an acceptable bound of memory bandwidth requirement/processing time
Background - D2FA s3 s3 a a b s4 b s4 • Observation: • DFAs from practical datasets have redundancy in state transitions • Idea: • default transitions: non-consuming transitions s1 s1 c c s5 s5 s3 a b s4 s2 c s6 s2 c s6 • Implication: • Traversal time / memory bandwidth requirement dependent upon maximum default path length
Background – D2FA construction RegEx: ab+c+, cd+ and bd+e from 1-8 b Space reduction graph a b 3 2 1 2 1 c 3 4 Remaining transitions a 4 c 3 5 c 3 8 3 c 0 d 4 4 4 c d d 0 4 5 4 c 4 d 4 c 7 4 c 4 c 8 4 e b 4 5 4 d 6 7 5 6 4 b d from 3-8
from 1-8 b a b 1 2 c Remaining transitions a c 3 c c d c d d 0 4 5 c d c c c 8 e b d 6 7 b d from 3-8 Background – D2FA construction RegEx: ab+c+, cd+ and bd+e Space reduction graph [4] 3 [4] 1 2 4 3 4 [3] [4] 5 3 8 3 [3] 0 4 4 4 4 4 4 7 4 [2] 4 [4] 4 4 5 4 5 6 [3] 4 [3] Diameter bound=4 removed transitions=33
from 1-8 b a b 1 2 c Remaining transitions a c 3 c c d c d d 0 4 5 c d c c c 8 e b d 6 7 b d from 3-8 Background – D2FA construction RegEx: ab+c+, cd+ and bd+e D2FA Space reduction graph b [4] b 3 [4] 1 2 1 2 4 3 c 4 [3] 3 [4] 5 d a 3 8 3 [3] c 0 4 e 4 d 4 d 4 0 5 4 4 d 4 7 4 b [2] 4 8 [4] 4 e 4 5 4 d 6 7 5 6 [3] 4 [3] Diameter bound=4 removed transitions=33
Background – D2FA construction RegEx: ab+c+, cd+ and bd+e D2FA Space reduction graph from 1-8 b b [4] a b b 3 [4] 1 2 1 2 1 2 c 4 3 Remaining transitions a c 4 [3] 3 [4] c 5 d a 3 3 c 8 3 [3] c 0 c d 4 e 4 d 4 d c d 4 0 5 d 4 0 4 5 4 d c 4 7 4 d b [2] c 4 c 8 4 e c 8 e 4 b 5 4 d 6 7 d 5 6 7 6 [3] 4 [3] b Diameter bound=4 removed transitions=33 d from 3-8 [2] [2] 3 1 2 3 4 2 [2] [2] 4 3 8 5 3 Traversal time=O((D/2+1)N) Time complexity=O(n2logn) Space complexity=O(n2) 0 4 4 4 [1] 4 3 4 3 [2] 7 [1] 4 4 4 4 5 4 4 [2] 5 [2] 6 4 Diameter bound=2 removed transitions=28
Transition redundancy: why? RegEx: ab+c+, cd+ and bd+e from 1-8 b a b 1 2 c Remaining transitions a c 3 c c d c d d 0 4 5 c d c c c 8 e b 6 7 • Forward transitions: • Matches • State specific • Backward transitions: • Mismatch • Shared by multiple states d b from 3-8 d • Idea: • Introduce state depth: minimum distance from entry state • Orient default transitions only backwards (towards decreasing depth) • Pros: • Traversal time O(2N)independent of the maximum default path length • Generality: no need of diameter bound parameter • Cons: • Possible compression loss
from 1-8 b 3 a 1 2 b b 4 3 1 2 c 4 b 2 1 Remaining transitions a 3 8 5 3 c 3 c 0 4 c 4 4 d 4 c 3 a d c d 4 c d 0 4 5 e d 7 4 4 c d 4 d 0 4 5 c c 4 4 d c 5 8 e b b 5 8 6 d 4 e 6 7 d b 6 7 d from 3-8 Our scheme RegEx: ab+c+, cd+ and bd+e Reduced graph Oriented space reduction graph [2] [1] [3] [1] [2] [0] [3] [1] [2] NO diameter bound Time complexity=O(2N) removed transitions=33 • Observations: • Maximum spanning tree on oriented graph: Edmonds and Chu solutions • 2 steps: edge selection and cycle resolution • No cycles: • Space reduction graph not necessary • Simple breath-first traversal algorithm Traversal time=O(2N) Time complexity=O(n2) Space complexity=O(n)
Discussion • Generalization (Jon Turner’s observation) • Allowing default transitions only from depth d to depth ≤d-k, w/ k1, leads to worst case traversal time • Time and space complexity of the construction algorithm still O(n2) and O(n) • Examples: • k=1 traversal O(2N) • k=2 traversal O(1.5N) • k=3 traversal O(1.33N) • k=4 traversal O(1.25N) • Compression • D2FA: • Constraint: diameter bound • Heuristic • Our algorithm: • Constraint: orientation (may be not a problem for RegEx originated DFAs) • Optimal solution
Discussion (cont’d) • Default transitions and depth computation through breath-first traversal • Default transitions can be computed during subset construction, that is, at DFA creation time. • D2FA space complexity can be an issue for big DFAs • O(n2): space reduction graph • Using adjacency list 17B/edge struct wgedge { vertex l,r; // endpoints of the edge weight wt; // edge weight edge lnext; //link to next edge incident to l edge rnext; //link to next edge incident to r } *edges; • Fully connected graph w/ ~11K nodes will require 1GB storage • Possible solutions: partial graphs based on weight • Multiple scans • Effect on algorithm’s execution time
Traversal locality DFA traversal exhibits locality Average traffic tends to mismatch States at low depths tend to be traversed more Backward default transition reiterate the traversal of likely states Discussion (cont’d)
Alphabet reduction • Observation: • Some symbols are treated in the same way over the whole DFA [(s,ci)= (s,cj) for each state sє DFA] • Example: • Ignore case • \n, \r • unused characters • Idea: • Group characters into classes • Mapping filter • Algorithm: • Sequence of clustering operations • Breath-first traversal w/ O(n2) complexity • Applicable at DFA creation time
Evaluation – number of transitions x16 x4 x23 x9.5 x8 x8 x35
Conclusion • DFA exhibit transition redundancy exploitable through default transitions • D2FA: algorithm trading off compression w/ traversal time • In this work, we propose generic algorithm: • With limited time and space complexity • Allowing O(2N) traversal time (or less) when processing input text • Leading to compression level similar to (or better than) D2FA
Thank you! Questions?
Our algorithm procedure default_transition (DFAdfa=(n, (states, )), modifies setdefault); listqueue; setdepth[n]; for state sstates depth[s]=n; default[s]=s; rof depth[0]=0;queue.push(0); while (!queue.empty()) state s= queue.pop(); int saving=0; for char c if (depth[(s,c)]=n) depth[(s,c)]= depth[s]+1; queue.push((s,c)); fi rof; for (state tstates & depth[t]<depth[s]) intcommon:=# common transitions btw. s and t; if (common > 1 && (common>saving || (common=saving && depth[t]<depth[default[s]]))) saving:=common; default[s]=t; fi rof; end while; end;